from Hacker News

Storage engine design

by chaotic-good on 5/2/17, 11:55 AM with 28 comments

by lima on 5/2/17, 2:01 PM
It looks like Yandex recently open-sourced their Graphite engine they built on top of Clickhouse:
https://github.com/yandex/graphouse
Looks really interesting for Graphite-like use cases.

by eis on 5/2/17, 2:20 PM

I don't think you want to measure your timeseries datastructure against LSM-trees because the latter is inherently a pretty bad structure to use for timeseries (which are mostly append-only) as a few projects painfully found out.

Anyways, I'm interested in timeseries so I read the article and tried to understand the datastructure but to be honest it opens up more questions than it answers. I applaud people though who try to describe their datastructures that are core to the app. Thanks for that.

  1. What is the exact datastructure of a leaf? You mention a leaf node can hold 1000 datapoints.
  2. Why is using a WAL impossible? That should be possible for any datastructure.
  3. In your example if level 3 is full and everything gets merged into level 4, there are no levels 1-3. How does one query the datastructure? Does it maintain a pointer from the root to level 4?
  4. Related to above: if I want to query all data from last month until now which happens to be the start of level 4, it will first go to root, then to the level 2 Leaf, from there to the level 2 SBlock, from there to the level 3 Leaf, then level 3 SBlock, then level 4 Leaf, then level 4 Sblock, then the first level 4 leaf? That seems a lot of random access. How many iops does a lookup need?
  5. SBlocks need to be kept in memory. If I have ten million timeseries (not uncommon, can be much more), each with 3 levels, then upon startup the app has to load 30M SBlocks into memory?
  6. You say that several trees can be interleaved in a single file, how does that not break the linear IO advantage of columnar layouts?
  7. How do you store information in the "inner nodes" (SBlocks?) to speed up aggregation? Do you store every possible aggregation in there? E.g. sum, avg, min, max, stddev, ...
  8. Storage format of an individual time series is only a part of what a TSBD needs to do, another part is how to find all the timeseries for a particular query. How does that work?

And in general I think you can't have all three:

  A) good write performance
  B) no write amplification
  C) low RAM usage

... because you have to either write data out immediately (and get either 2x+ write amplification or lots of random writes/reads) or buffer it in RAM to batch linear writes.

I think there are some interesting ideas in this structure, it looks to me more like a Linked List of one level deep B+Trees, not a big overall tree.

by menegattig on 5/2/17, 1:53 PM
Good content for creating time-series database engines that was just posted on other HN thread:
https://medium.com/slicingdice-com-blog/want-to-build-a-new-...
https://news.ycombinator.com/item?id=14246189
by evdev on 5/2/17, 8:02 PM
What I've found matters in this area is the mismatch in locality between elements in read batches and elements in write batches. It'd be nice if the emerging DBs that deal with these issues put at least a gloss on the information model and write -> read lifecycle they're targeting.
Otherwise, a lot of these "actually you need X for time series!" are just talking past each other because "time series" means any number of actual patterns.
by tianyicui on 5/3/17, 12:57 PM
It seems the kind of query for N time series at a specific timestamp or across a small span will be inherently slow because of O(N) block read? Is there some way to support this kind of query efficiently?
by stuff4ben on 5/2/17, 12:42 PM
Fascinating stuff! How does this compare to other TSDB's like Influx (which uses LSM-tree) and Druid?
by batisteo on 5/2/17, 2:49 PM
Yet another project with an Esperanto name