from Hacker News

Eventual Consistency isn’t for Streaming

by arjunnarayan on 7/14/20, 2:11 PM with 49 comments

by kqr on 7/14/20, 2:47 PM
I agree with the other commenter. Eventual consistency has always been roughly a synonym for "tactical lack of consistency." The reason this works is that inconsistency is, in many business domains, not such a big deal as we make it out to be. Most business are used to data lagging behind, documents being filed incorrectly, decisions being changed and half of documents referring to the old decision, to mention just a few possibilities. As long as everything is dated and there are corroborating versions of all facts, this can be untangled by experts in the few cases it really matters. Most of the time, it doesn't matter that much.
Eventual consistency is embracing this philosophy of a lack of consistency for computer systems too, on the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.
This of course, in principle, can lead to ever degrading consistency and since you can't assume everything is consistent, you also cannot really verify consistency in any other way than heuristically, as another commenter suggested.
Eventual consistency is a design driven by practical needs. It is never a path to reach complete data purity.
And this applies both to streaming and batch tasks alike.
by asdfasgasdgasdg on 7/14/20, 2:32 PM
This article isn't very convincing to me. I mean, I one hundred percent buy that eventually consistent stream processing systems can theoretically be subject to unbounded error. But eventual consistency isn't just a theoretical model. It's also a practical engineering decision, and so in order to evaluate its use for any given business purpose we have to see how it performs in practice. That is, what is the average/99.9%/max error? And we have to understand how business-critical the correct answer is. This article has some great examples of theoretical issues with eventually consistent stream processing computation, but it doesn't demonstrate that any real systems evince these problems under any given workload.
by cs702 on 7/14/20, 2:43 PM
For more concise and precise explanations of the rationale for these kinds of tools, see this paper: https://github.com/TimelyDataflow/differential-dataflow/raw/... -- here's the abstract:
> Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams.
See also this friendlier (and lengthier) online book: https://timelydataflow.github.io/differential-dataflow/
by alextheparrot on 7/14/20, 2:49 PM
I'm actually just fundamentally confused about what is being argued.
I'm familiar with streaming, as a concept, from the likes of Beam, Spark, Flink, Samza - they do computations over data, producing intermediate results consistent with the data seen so far. These results are, of course, not necessarily consistent with the larger world because there could be unprocessed or late events in a stream, but they are consistent with the part of the world seen so far.
The advantage of streaming is the ability to compute and expose intermediate snapshots of the world that don't rely on the stream closing (As many streams found in reality are not bounded, meaning intermediate results are the only realizable result set). These intermediate results can have value, but that depends on the problem statement.
To examine one of the examples, let's use example 2, this aligns with the idea that we actually don't have a traditional streaming problem. The question being asked is "What is the key which contains the maximum value". There is a difference between asking "What is the maximum so far today" and "What was the maximum result today" -- the tense change is important because in the former the user cares about the results as they exist in the present moment, whereas the other cares about a view of the world in a time frame that is complete. It seems like the idea of "consistent" is being conflated with "complete", wherein "complete" is not a guaranteed feature of an input stream.
If anyone could clarify why the examples here isn't just a case of expecting bounded vs unbounded streams?
by nikhilsimha on 7/14/20, 6:20 PM
In both examples 2 and 3, the author reads the same stream twice independently and assumes that a join is not synchronized between the transformed streams. This seems like a fundamental flaw in their offering.
Pushing in a timestamp along with the max/variance change stream[1]. And then using the timestamp to synchronize the join[2] would naturally produce a consistent output stream.
I quoted flink because they have the best docs around. But it should be possible in most streaming systems. Disclaimer, I used to work for the fb streaming group and have collaborated with the flink team very briefly.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11...
by dekimir on 7/14/20, 4:14 PM
> you should be prepared for your results to be never-consistent
Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?
And finally, how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?
by anonymousDan on 7/14/20, 6:31 PM
There's been plenty of work in the past on weaker correctness guarantees for stream processing system (e.g. concepts like rollback and gap recovery from Aurora). Not sure it's an either/or between eventually consistent and strong consistency.
by satyrnein on 7/15/20, 2:17 AM
Side question - has anyone tried using Materialize beyond toy workloads? Can I move billions of rows off of a batch workflow on Snowflake onto Materialize and suddenly everything is near realtime?
by DevKoala on 7/14/20, 5:39 PM
I keep falling for these clickbait titles in the hopes I will find a fair argument. However, the moment I realize the article is trying to sell me a product based around an argument, I lose faith on the perspective of the writer.
If the title was something more honest such as “How product X solves for Y” I’d feel more compelled to put trust on the analysis being objective.
by tlarkworthy on 7/14/20, 2:55 PM
Firebase provides causal consistency. By subscribing to streams (listen), the client opts into which data sources it was consistent snapshots of, then all distinct client streams are bundled up and delivered in order over the wire. It's a very elegant model which does not get in the way and has nice ergonomics.
by andrekandre on 7/14/20, 9:50 PM
so, if i understand the article correctly, for purposes of realtime reporting/monitoring (streaming, as stated), eventual consistency is not an appropriate "store" to hook into because you cant know when things have become consitstent, and reliable streaming of (near?) realtime data requires some chance for that to occur
is that a correct interpretation?
by erikerikson on 7/14/20, 5:22 PM
TL;DR: accessing materializations is necessarily a snapshot.
This article reads as though the author hadn't shifted mindset from "the database will solve it for me" to "I'm taking on the relevant subset of problems in my use case". This seems off given that they're trying to sell a streaming product. They claim their product avoids problems by offering "always correct" answers which requires a footnote at the very least but none was given.
Point of note: The consistency guarantee is that upon processing to the same offset in the log that, given that you have taken no other non-constant input, you will have the same computational result as all other processes executing semantically equivalent code.
I take this sort of comment as abusive of the reader:
> What does a naive application of eventual consistency have to say about > > -- count the records in `data` > select count(*) from data > > It’s not really clear, is it?
A naive application of eventual consistency declares that along some equivalent of a Lamport time stamp across the offsets of shards in the stream, the system will calculate account of records in data as of that offset. Given the ongoing transmission of events that can alter the set data, that value will continue changing as appropriate and in a manner consistent with the data it processes. The new answers will be given when the query is run again or it may even issue an ongoing stream of updates to that value.
Maybe it got better as the article went on...
by ecopoesis on 7/14/20, 4:04 PM
Almost every distributed system (including "simple" client-server systems) is eventually consistent. And all systems are distributed.
It's great that your DB is ACID and anyone who queries it gets the latest greatest but in reality you also have out of date caches, ORM models that haven't been persisted, apps where users modifying data that hasn't been pushed back to the server and a million other examples.
I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.
Instead of constantly fighting eventual consistency just learn to embrace it and its shortcomings. Design systems and write code that are resilient to splits in HEAD and provide easy methods to merge back to a single truth.