by arjunnarayan on 7/14/20, 2:11 PM with 49 comments
by kqr on 7/14/20, 2:47 PM
Eventual consistency is embracing this philosophy of a lack of consistency for computer systems too, on the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.
This of course, in principle, can lead to ever degrading consistency and since you can't assume everything is consistent, you also cannot really verify consistency in any other way than heuristically, as another commenter suggested.
Eventual consistency is a design driven by practical needs. It is never a path to reach complete data purity.
And this applies both to streaming and batch tasks alike.
by asdfasgasdgasdg on 7/14/20, 2:32 PM
by cs702 on 7/14/20, 2:43 PM
> Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams.
See also this friendlier (and lengthier) online book: https://timelydataflow.github.io/differential-dataflow/
by alextheparrot on 7/14/20, 2:49 PM
I'm familiar with streaming, as a concept, from the likes of Beam, Spark, Flink, Samza - they do computations over data, producing intermediate results consistent with the data seen so far. These results are, of course, not necessarily consistent with the larger world because there could be unprocessed or late events in a stream, but they are consistent with the part of the world seen so far.
The advantage of streaming is the ability to compute and expose intermediate snapshots of the world that don't rely on the stream closing (As many streams found in reality are not bounded, meaning intermediate results are the only realizable result set). These intermediate results can have value, but that depends on the problem statement.
To examine one of the examples, let's use example 2, this aligns with the idea that we actually don't have a traditional streaming problem. The question being asked is "What is the key which contains the maximum value". There is a difference between asking "What is the maximum so far today" and "What was the maximum result today" -- the tense change is important because in the former the user cares about the results as they exist in the present moment, whereas the other cares about a view of the world in a time frame that is complete. It seems like the idea of "consistent" is being conflated with "complete", wherein "complete" is not a guaranteed feature of an input stream.
If anyone could clarify why the examples here isn't just a case of expecting bounded vs unbounded streams?
by nikhilsimha on 7/14/20, 6:20 PM
Pushing in a timestamp along with the max/variance change stream[1]. And then using the timestamp to synchronize the join[2] would naturally produce a consistent output stream.
I quoted flink because they have the best docs around. But it should be possible in most streaming systems. Disclaimer, I used to work for the fb streaming group and have collaborated with the flink team very briefly.
[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11...
by dekimir on 7/14/20, 4:14 PM
Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?
And finally, how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?
by anonymousDan on 7/14/20, 6:31 PM
by satyrnein on 7/15/20, 2:17 AM
by DevKoala on 7/14/20, 5:39 PM
If the title was something more honest such as “How product X solves for Y” I’d feel more compelled to put trust on the analysis being objective.
by tlarkworthy on 7/14/20, 2:55 PM
by andrekandre on 7/14/20, 9:50 PM
is that a correct interpretation?
by erikerikson on 7/14/20, 5:22 PM
This article reads as though the author hadn't shifted mindset from "the database will solve it for me" to "I'm taking on the relevant subset of problems in my use case". This seems off given that they're trying to sell a streaming product. They claim their product avoids problems by offering "always correct" answers which requires a footnote at the very least but none was given.
Point of note: The consistency guarantee is that upon processing to the same offset in the log that, given that you have taken no other non-constant input, you will have the same computational result as all other processes executing semantically equivalent code.
I take this sort of comment as abusive of the reader:
> What does a naive application of eventual consistency have to say about > > -- count the records in `data` > select count(*) from data > > It’s not really clear, is it?
A naive application of eventual consistency declares that along some equivalent of a Lamport time stamp across the offsets of shards in the stream, the system will calculate account of records in data as of that offset. Given the ongoing transmission of events that can alter the set data, that value will continue changing as appropriate and in a manner consistent with the data it processes. The new answers will be given when the query is run again or it may even issue an ongoing stream of updates to that value.
Maybe it got better as the article went on...
by ecopoesis on 7/14/20, 4:04 PM
It's great that your DB is ACID and anyone who queries it gets the latest greatest but in reality you also have out of date caches, ORM models that haven't been persisted, apps where users modifying data that hasn't been pushed back to the server and a million other examples.
I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.
Instead of constantly fighting eventual consistency just learn to embrace it and its shortcomings. Design systems and write code that are resilient to splits in HEAD and provide easy methods to merge back to a single truth.