from Hacker News

Scaling to 100M: MySQL Is a Better NoSQL

by andreyvit on 5/24/16, 5:03 PM with 175 comments

by koolba on 5/24/16, 7:15 PM
So much to disagree with here ...
> Locks limit access to the table, so on a high throughput use case it may limit our performance.
Then use a proper database that implements MVCC.
> Do not use transactions, which introduce locks. Instead, use applicative transactions.
Or just use a database that handle transactions more efficiently.
> `site_id` varchar(50) NOT NULL,
Why varchar(50)? UUIDs are 16-bytes. The best way to store them would be the binary bytes (which is how postgres stores them). If it's hex without dashes, it'll be varchar(32). If it's hex with dashes, it'll be varchar(36). Why did they pick 50? Future growth? Smart keys? Schema designer doesn't know what a UUID actually is?
> Do not normalize.
Bullshit. Normalize as much as is practical and denormalize as necessary. It's much easier to denormalize and it greatly simplifies any transaction logic to deal with a normalized model.
> Fields only exist to be indexed. If a field is not needed for an index, store it in one blob/text field (such as JSON or XML).
This is terrible advice. Fields (in a table) exist to be read, filtered, and returned. If everything is in a BLOB then you have to deserialize that BLOB to do any of those. That doesn't mean you can't have JSON "meta" fields but if your entire schema id (id uuid, data json) you're probably doing it wrong. It's next to impossible to enforce proper data constraints and all your application logic becomes if/then/else/if/then/else... to deal with the N+1 possibilities of data. Oh and when you finally add a new one, you have to update the code in M+1 places.
by peter_d_sherman on 5/24/16, 7:33 PM
Aside from the MySQL vs. Other DB debate (which I refuse to take part of, although I'm willing to ascribe good points to all camps), this article is absolutely excellent with respect to acting as a guide for people who want to use MySQL as a Key/Value store. Absolutely stellar article! All of the points are dead-on. I applaud the author for putting together so much specific information about tuning MySQL for Key/Value in one place, and the ridiculous speed and scalability you can get if you do it correctly. (That being said, NoSQL Key/Value databases are good too.)
by TheGuyWhoCodes on 5/24/16, 5:54 PM
This is basically "We made it work, easy, all the rest are wrong". Wix is 10 years old, they probably started with MySql and stuck with it, is it wrong? Maybe, maybe not. If they were to start today would they have used Mysql aswell or gone with another solution? Did they spend the last 10 years building tools to help them scale MySql (at which point it's easy for them to operate) rather than use a tool that had multi server or multi DC in-mind.
Oh and citing statistics without details is plain lying, how many server, how much RAM, SSD based or HDD....
by tiffanyh on 5/24/16, 5:42 PM
Since Wix is using MySQL as a key-store ... I wonder why they didn't look at using Postgres HStore [1].
HStore is a key value store built directly in the RDBMS of Postgres.
[1] http://www.postgresql.org/docs/9.6/static/hstore.html
by ph33t on 5/24/16, 6:46 PM
I hate these stupid "my db is better than whatever db" articles. 1) What db to be used depends on the situation AND MORE IMPORTANTLY 2) what experience your staff has
I can say that 10 years go, I would have chosen M$SQL over MySQL and it would have been the correct choice. At the time I had almost 10 years experience with M$SQL and almost none with MySQL. Now I have more than 10 years of MySQL under my belt. AND the MySQL experience is more current. Right now I could choose between the two based on specific features and performance characteristics. For me to pick Posgresql because of a specific feature would be insane because I don't experience with it. No knock on Posgresql ... maybe I'll spend time with it and pick it for some down-the-road project.
I have implemented couchdb as a caching solution. I know how to manage, backup, and restore the database server. I have managed a 5 node cluster. If you ask me to implement NoSQL, it would be my choice for 2 reasons: 1. It can do the job. 2. I have experience making it do the job.
I'm sure there are 10 million people out there would would choose mongo in the same situation. The would not be wrong and they may come up with a superior solution. For me to implement Mongo today would be wrong - I would almost certainly come up with an inferior solution. for them, it would be stupid not to.
I'm not saying "don't learn anything new". I'm saying "don't gamble your business on technology with which you're not familiar".
Its a bit like backups ... the most important thing about a backup is not the technology you use, but whether you are capable of restoring and maintaining the backups.
by jjoe on 5/24/16, 6:30 PM
Scalability is like an abstract painting. It's unique to one's infrastructure. Its writing or sometimes postmortem makes good brain fertilizer. Not so much more. Beyond that I wouldn't rush to implement scalability du jour.
A setup that works for a certain service won't necessarily work for another unless yours is a very close replica. Based on my experience in this area, and I'm a performance seeking nut, each platform, and even each traffic pattern, needs its own thinking hat.
That's what makes it so fun!
by fapjacks on 5/24/16, 5:58 PM
This may be an unpopular perspective, but here goes. For many years I ran a business doing web development. I had many clients approach me who were using Wix, and who I could not help, because Wix had effectively taken hostage their images. Because of those years of bad experiences (telling clients that they are screwed unless they keep paying Wix), I do not trust Wix, and so I do not trust this post. Should those clients have trusted that Wix would make their data available in the future? No, totally not. But that is the cost of doing shady things. Everything with your name on it now gets taken with a grain of salt.
by xrstf on 5/24/16, 5:52 PM
I wonder if using the memcached plugin for InnoDB[1] would speed things up even more, at the expense of not having flexible queries (and thereby introducing multiple roundtrips) anymore. Presumably, they are using simple "SELECT * FROM table WHERE id = ?" in most places anyway, so that could be an okay tradeoff to make.
[1] https://dev.mysql.com/doc/refman/5.6/en/innodb-memcached.htm...
by BinaryIdiot on 5/24/16, 6:19 PM
> Use client-generated unique keys. We use GUIDs.
Minor note but wouldn't UUIDs be better since they're time based? Sure it's really unlikely to hit an already used GUID but an UUID makes it impossible.
In fact is there a use case where it's better to use GUIDs over UUIDs? I couldn't think of one but I could be omitting something from my thinking so I'm curious.
Edit: apparently GUID and UUID are the same thing and GUID is simply Microsoft's original implementation of UUID. All this time I had no idea...
by jlas on 5/24/16, 5:46 PM
> When someone clicks a link to a Wix site... That server has to resolve the requested site from the site address by performing a key/value lookup URL to a site.
So Wix uses MySQL to resolve site routes internally? Is this the best way to do it? Would it be possible to use internal domain names and rely on DNS to resolve everything?
by gshulegaard on 5/24/16, 7:27 PM
For those interested in NoSQL (particularly MongoDB) you may find this an interesting read:
https://www.linkedin.com/pulse/mongodb-32-now-powered-postgr...
But over time I finding less and less reason to _not_ use PostgreSQL when contemplating a NoSQL document store.
by rantanplan on 5/24/16, 5:38 PM
And PostgreSQL is a better MySQL so... all is settled?
by wefarrell on 5/24/16, 5:49 PM
I was expecting a comparison but they only presented one side.
by mh-cx on 5/24/16, 8:43 PM
I have not heard about "Wix" before, but maybe they should have done some more research before picking this name. To a German this sounds like "wichsen" which means, well, "wank"[1].
[1] http://dict.leo.org/ende/index_de.html#/search=wichsen
by sjwright on 5/25/16, 1:32 AM
I've been running a rather large website with MySQL for the past fifteen years. There was a period when I regretted that choice and used something else. Today I'm using MariaDB and the TokuDB storage engine, and I'm so thankful that I never migrated to Postgres.
Like many people I investigated the NoSQL movement for potential applicability, and almost swallowed the hype. As I investigated more, I realised:
1. There are some specific instances where a NoSQL engine makes good sense. They're a valid option and should be considered depending on the application. In my experience though, well formed RDBMS structures are the better option in the vast majority of applications.
2. Most of the hype and growth came from people who (a) were using the abomination known as ORMs which are the canonical example of a round peg in a square hole; and/or (b) didn't know how to build performant RDBMS schemas. For these people, the NoSQL engine was fast because it was the first engine they actually learned how to optimise correctly.
by electrotype on 5/24/16, 8:42 PM
A little bit off topic, but I would like to hear more about using Solr [1] instead of any "real" NoSQL databases.
I don't have experience with MongoDb and such, but I've always asked myself why someone wouldn't use Solr as a distributed NoSQL database... Am I wrong or, with Solr, you get that key/value scalable storage AND you get advanced search features as an extra?
Why would I want to use MongoDb instead of Solr? What killer feature Solr doesn't have?
[1] http://lucene.apache.org/solr/
by jamesblonde on 5/25/16, 5:35 AM
I thought this article would be about the true MySQL NoSQL system: MySQL Cluster (or NDB). It scales to 200m transactional reads per second - per second! http://highscalability.com/blog/2015/5/18/how-mysql-is-able-... We have got 16m read/sec on our commodity rack with MySQl Cluster, so it's not a fantasy result.
by Illniyar on 5/25/16, 8:24 AM
Two things that are sorely missing in this comparison to NoSql is:
How are they performing horizontal scaling, I'm guessing they aren't, without addressing the issue of sharding and scaling they can't really compare the solution to NoSql - it is the number 1 feature that NoSql has over RDBMS.
If they are achieving 1ms response time , then they almost certainly have the entire table in memory cache. What happens when the data grows beyond the size of the memory and it's not financially feasible to get a larger memory instance.
by fleaflicker on 5/24/16, 11:06 PM
From 7 years ago, by Bret Taylor (who went on to become CTO at Facebook after acquisition):
How FriendFeed uses MySQL to store schema-less data https://backchannel.org/blog/friendfeed-schemaless-mysql
Edit to add the HN discussion at the time: https://news.ycombinator.com/item?id=496946
by EGreg on 5/25/16, 5:58 AM
Here is my (albeit limited experience) advice:
1. Use PostgreSQL, or MySQL with InnodDB for row level locking
2. Huge tables should be sharded with the shard key being a prefix of the primary key.
If you need to access the same data via different indexes then denormalize and duplicate the index data in one or more "index" tables.
3. Do not use global locks. Generate random strings for unique ids (attempt INSERT and regenerate until it succeeds) instead of autoincrement.
4. Avoid JOINs across shards. If you use these, you won't be able to shard your app layer anymore.
5. For reads, feel free to put caches in front of the database, with the keys same as the PK. Invalidate the caches for rows being written to.
It's actually pretty easy to model. You have the fields for the data. Then you think by which index will it be requested? Shard by that.
Note that this will still lead you to a huge centralized datacenter!! Because your authentication happens at the webserver level and then you just have all the servers trust each other. While it is a nice horizontal architecture, it leads to crazy power imbalances like we have today. Consider instead making it a distributed architecture, where the shards turn into domains, and each user on each domain has to auth with every other domain. But your network can then be distributed without a single point of failure. What's more, local area networks will be able to host your app and be quick without the signal bouncing halfway around the world.
by hifier on 5/24/16, 8:27 PM
It seems to me that one of the core differences between MySQL/Postgres and distributed stores like Cassandra / Hbase is that with the former your data and your write workload have to fit onto a single host. If either one cannot fit then you have to partition at the application level or use a real distributed data store. Partitioning at the app level is an operational burden and complexity that would be best avoided, but there are always exceptions.
by cmenge on 5/24/16, 11:23 PM
So MySQL is great if you use none of its features, but then its really hardly different from all the other databases. So it's not the implementation, but the very promises that databases make which can't be held, but if you know that, you are just fine. Great insight, and pretty much the definition of NoSQL...
by chucky_z on 5/25/16, 3:47 AM
Everyone should try PostgreSQL with hstore (and JSONB now, too!).
This is a key/value store inside an RDBMS that just works, and it works great!
I converted a crappy sloppy super messy 1000+ column main table in a ~800GB database to use hstore, it was, in real world benchmarks, between 7x and 10,000x (yes, really, ten thousand times) faster.
The CEO of the company who had a technical say in everything, and was very proud of his schema "wasn't excited" and it never happened in any production instance.
I've left since then, and the company has made very little advancement, especially when it comes to their database.
Really, just use hstore. Try it out. The syntax is goofy, but... I mean, SQL itself is a little bit goofy, right?
by adenverd on 5/25/16, 2:23 AM
100M? Of course you'd scale an RDBMS for that, especially if you want searchability and analytics. It's way easier than a Hadoop -> Elasticsearch pipeline (or pick your flavor).
NoSQL databases are for BIG data. As in, billions of rows big.
by krosaen on 5/24/16, 6:48 PM
related: friendfeed used a similar approach https://backchannel.org/blog/friendfeed-schemaless-mysql
by mrmrcoleman on 5/24/16, 6:40 PM
Very interesting post. I seem to remember that you were featured in a MongoDB 'success story' last year but they seem to have removed it now.
Does that mean you've stopped using Mongo altogether?
by languagehacker on 5/25/16, 12:15 AM
Just about as wrong as the day it was posted here in December
by neeleshs on 5/24/16, 7:56 PM
Awesome! Now lets do it for 1B rows, and then 10 Billion and then some. It is well known that for small datasets NoSQL is no better , if not worse, than an RDBMS.
by stevesun21 on 5/24/16, 11:02 PM
System design 101: keep business logic into the layer above database layer rather than relying on specific db system to implement them. In this way to design an system, there shouldn't have any different between using MySQL of using NoSQL, their role is just storage engine. So, you don't need to follow the relational database practice, like for example, foreign key, constrains, normalization anymore.
by okigan on 5/24/16, 5:46 PM
>An active-active-active setup across three data centers.
Any info how "active-active-active" (I assume 3 aws regions) is accomplished?
by ai_ja_nai on 5/25/16, 8:40 AM
MySQL looks great when used as K-V because it avoids the bad planner (when you have a primary key as the only searching key, a planner is useless) and denormalizing avoids expensive JOIN ops.
But there is the awkward replication model, the lack of native data structures as column type and the lack of sharding support.
by bechampion on 5/25/16, 7:25 AM
I love mysql , saved my ass many times , but this article doesn't mean anything .. it just says that you can use subqueries and joins to do "nosql"... we all know that .. you can also use a text file. I'd like if mysql copies what postgres has done with hstore.
by jamiequint on 5/24/16, 7:18 PM
Wouldn't Aerospike be a cheaper, lower maintenance, and more robust solution to this problem?
by jondubois on 5/25/16, 3:25 AM
The problem with SQL DBs is that they just weren't designed for distributed computation to begin with. SQL doesn't take into account CAP theorem - So it lets you write queries which work on a single machine but which cannot scale to multiple machines.
On the other hand, many NoSQL databases like MongoDB and RethinkDB have a query language which was designed to run on both single-machines and distributed infrastructure (in a homogeneous way); the same queries which work on a single machine will also work at scale on multiple machines - No need to rewrite your queries as your app grows.
You CAN scale with SQL but you have to know what queries to avoid (E.g. table joins, nested queries...) but with NoSQL, you don't have to avoid any queries; if it's in the Docs, it's safe to use.
Finally, a major difference between SQL vs NoSQL is the typed vs untyped structure. Most SQL databases were designed in a time when statically typed languages were mainstream; so it made sense for SQL databases to enforce static typing on their data.
On the other hand, NoSQL was designed in a time when dynamically typed languages where popular and gaining more popularity (E.g. Ruby, Python, JavaScript); when using these languages, having to add SQL-specific types to data feels like an unnecessary step. With NoSQL you can still enforce a schema in the application code but your schema logic doesn't have to abide by any type constraints from DB layer - Your schema is the ultimate authority on typing of your DB - If gives you the flexibility to be lazy with type-checking in the areas which are low-importance (where errors are tolerable) and strict where data type consistency is paramount.
Generally, NoSQL DBs impose constraints to query expressiveness in order to free you from architectural constraints. SQL DBs impose few constraints on query expressiveness but because of this, they add architectural constraints to your system.
by ecolak on 5/24/16, 9:11 PM
"Many developers look at NoSQL engines—such as MongoDB, Cassandra, Redis, or Hadoop" Noone uses Hadoop as a database. On the other hand, HBase which uses HDFS as underlying storage is a great NoSQL database that we use in production.
by return0 on 5/24/16, 7:14 PM
This seems specific to their use case. He shows an example with a subquery. I wonder why they don't break that to two queries. The should be fast enough if cached, and would prevent the need for both tables to be unlocked during query.
by eblanshey on 5/25/16, 5:09 AM
> Do not perform table alter commands. Table alter commands introduce locks and downtimes. Instead, use live migrations.
Care to elaborate more on this? What do you mean by live migrations?
by 0n34n7 on 5/25/16, 6:10 PM
Mongo comes with geospatial indexing baked right in. Never mind map / reduce. It comes down to the data structures of our times, which are increasingly not relational.
by Sarki on 5/25/16, 8:11 AM
So if I got it straight the message is: "Don't fall for the sirens of hype but instead make sure that your choice of technologies suits your needs"?
by tacone on 5/24/16, 7:58 PM
The first thing that comes to mind is that they write about read throughput, when the write throughput is a big selling point of many NoSQLs.
by bchociej on 5/24/16, 5:42 PM
Kinda begging the question aren't we? I turn to nosql for things that aren't key-value, generally.
by HolyHaddock on 5/25/16, 4:30 PM
Does anyone know what they mean by `Instead, use applicative transactions.`?
by markhops on 5/26/16, 4:19 PM
Stupid question here: what are serial keys, and how do they impose locks?
by andradejr on 5/26/16, 10:32 AM
Anybody else thinks Postgres would have been a better comparison here?
by 1024core on 5/24/16, 8:32 PM
I'm surprised there's no mention of HandlerSockets.
by tomphoolery on 5/25/16, 2:30 AM
Does this mean PostgreSQL is an even better NoSQL? ;-)
by SliderUp on 5/27/16, 1:50 AM
Is a 100 million 'scaling up' these days?
by meshko on 5/24/16, 10:24 PM
Upon reading this i have three questions for them: 1) Do you do backups? 2) Do you use source control versioning system? and last but not least 3) Why do you kill so many kittens?
by return0 on 5/24/16, 8:53 PM
Glad to see the nosql hype blowing down to reasonable levels. Next up: imperative languages back in vogue.
by zzzcpan on 5/24/16, 6:07 PM
I thought nosql movement was about distributed systems, cap and all that. What does this "active-active-active" even mean? No consistency and no availability guaranties I presume?
by meeper16 on 5/24/16, 6:07 PM
I'm a purist and also need the absolute fastest lookups with out SQL overhead so I go straight for MDB (Sleepycat BerkleyDB) - faster than LevelDB or any others.
by rubenolivares on 5/25/16, 1:14 AM
Why can't autists just use whatever they want and stop trying to convince people that their way is best.