from Hacker News

On moving from CouchDB to Riak

by franckcuny on 3/7/11, 2:22 PM with 39 comments

  • by smanek on 3/7/11, 2:54 PM

    I went through a length evaluation process of Riak recently, and came away with a generally positive impression.

    First of all, it's beautifully engineered, as long as you just need a KV store or a graph DB (I wasn't in love with the MapReduce stuff, but that's another story). None of the hassle that Hadoop/Hbase have about some nodes being special (HBase Master, HDFS Namenode, etc). Also, no running multiple daemons per server (e.g., no distinction between regionserver and datanode daemons, like HBase). Easy config files, simple HTTP API (so you can just throw an off the shelf load balancer like HAProxy in front of it), and lots of little things that just make life easier.

    I also really like how it's very explicit about the CAP tradeoffs it makes - with powerful conflict resolution available for when consistency has been sacrificed (instead of trying to sweep the issue under the rug, like many other distributed dbs do).

    However, there are a few downsides.

    First, as mentioned in the article, with the default backend (a kv-store called 'bitcask') all the keys per node have to fit in memory (and each key has, on the order of, 20 bits of overhead, IIRC). Annoyingly, this fact isn't noted anywhere in the Riak documentation that I saw (although, there is a work-around by using the InnoDB backed). This won't matter too much for many use cases, but it can be pretty painful if you're not expecting it.

    Second, you can guard against data loss by specifying N copies of each piece of data are stored on your cluster. However, under some conditions (https://issues.basho.com/show_bug.cgi?id=228), the data may be replicated on less than N distinct nodes. So, the failure of less than N nodes could result in dataloss.

    Finally, to the best of my knowledge, the largest Riak clusters in production are around 60 nodes, while Hadoop has 2000 node clusters running (e.g., FB) in production. Perfectly acceptable for most users, but just one more thing to worry about if you're planning to roll out a large cluster (more potential for 'unknown unknowns', so to speak).

  • by rnewson on 3/7/11, 5:13 PM

    Describing Bigcouch as a bit of hack while admitting you've not used it seems a little unfair. That said, it's very valuable to hear how other people see our product.

    Without doing a full-on sales pitch (I am not a salesman and do not portray one on TV), I should say that Bigcouch adds a lot of desired features to CouchDB, notably scaling beyond a single machine and sharding (which supports very large databases). We run several clusters of Bigcouch nodes in production and have done for some time, it's a thrill for me personally to see a distributed database take a real pounding and keep on trucking.

    I've been meaning to try Riak myself, so you've inspired me to finally pull down the code and give it a proper examination.

  • by siculars on 3/7/11, 4:19 PM

    Re. File size growth in Couch:

    Couch files are written in an append only fashion so that all operations are considered appends, such as updates. The main upside is durability meaning the file is always readable and never left in an odd state. As noted, however, this has the downside of requiring compaction to reclaim disk space. You should note that if you are using bitcask, the default backend for riak, you will have the same problem that you had with couch as bitcask is also an aol (append only log) (aka. wol - write only log). Bitcask will also need compaction but as of the latest release there are various knobs to tweak regarding when that is queued up and when it is executed.

    Re. database being one file:

    I'm not exactly sure why couch uses one file but I suspect it is due to the use of b-trees internally. Would love to hear more from someone more experienced in couch. Riak splits its data amongst many files. Specifically one file (or 'cask' when using the default bitcask backend) per 'vnode' (virtual node, 64 by default). The drawback here is that you need to ensure that your environment has enough open file descriptors to service requests.

  • by philwise on 3/7/11, 6:25 PM

    "We store a lot of data... 2TB"

    Given that a pair of 2TB drives is less than $250 on ebuyer right now, 2TB of data is not 'big data'. You could comfortably stuff that in any decent database (SQL Server for example, I'm sure PostgreSQL would work too).

    Just because a tiny machine on slicehost isn't big enough doesn't mean that your data won't fit in a normal database.

  • by rch on 3/7/11, 4:01 PM

    Anyone planning to store significant quantities of data across multiple distinct nodes should have a look at the Disco Distributed File System:

    http://discoproject.org/doc/start/ddfs.html

    It handles 'data distribution, replication, persistence, addressing and access' quite well.

  • by electrum on 3/7/11, 6:01 PM

    > We rolled out a cluster of 5 nodes and 1024 partitions. Each node has 8gig of memories, 3x1Tb disk in RAID5.

    This hardware configuration doesn't make sense. 8 gigabytes of memory is pathetically small today. 32 gigs is basically standard, with 64 gigs costing little extra. RAID 5 with only three disks will have horrible performance. A single machine with 32/64 gigs and 12-16 disks in a better RAID configuration should perform better at a lower operational cost.

  • by davidhollander on 3/7/11, 6:01 PM

    What I disliked about Riak is that although at first glance it appears you can namespace key\values into multiple buckets you can't. The whole database is really only a single bucket and if you want to run a map\reduce it's always over every key in the database!

    Now you can simply increase complexity and create multiple Riak clusters for different data schemes and treat it as a replication tool. However, their in memory bitcask map-reduce was actually slower than a hard-drive map-reduce on text JSON files in a folder. Where each file was individually opened, read from disk, encoded, and closed in a loop (with nothing held in memory). Which was rather scary!

    Their replication scheme appears to be a pretty genuine copy of Amazon Dynamo though (consistent hashing ring, quorum, merkle trees) which is nice.

  • by rb2k_ on 3/7/11, 4:09 PM

    It's great to see that I'm not the only person having problem with CouchDB's compaction failing when dealing with larger datasets.

    I had the same exact experience during my master's thesis (hn: http://news.ycombinator.com/item?id=2022192 ) and this was one of the reasons why CouchDB didn't seem a good solution for a dataset that has a high number of updates.

  • by tillk on 3/8/11, 11:29 PM

    I'm a little let down by the detail of this blog post.

    For anyone who cares: we've had 6x as many documents (as linkfluence) in a single CouchDB instane before we moved to Cloudant. We now have over 360 million documents on Cloudant.

    Our database is very write intense, lots of new documents, little updates to the actual documents and a lot of reads of the documents in between. We also have a couple larger views to aggregate the data for our users.

    The ride with CouchDB has been bumpy at times, but not at the 'scale' where linkfluence is at.

  • by karmi on 3/7/11, 9:12 PM

    > Each modification generates a new version of the document. (...) we don’t need it, and there’s no way to disable it (...)

    Note that there's a `_revs_limit` setting available: <http://wiki.apache.org/couchdb/HTTP_database_API>. It's very beneficial for a use case like yours.

    (Though I've seen CouchDB performing rather poorly when taking really heavy read/write load or compacting big datasets, on occasion.)

  • by davidw on 3/7/11, 4:54 PM

    Hrm... with N "nosql" things, there's great potential for writing. Redis vs Riak. Cassandra to CouchDB. Why I gave up on Mongo and moved to Mnesia. And so on and so forth;-)