by remolacha on 5/10/24, 5:45 AM with 3 comments
1. Required: Write a query that joins an event stream with a historical table in Snowflake 2. Required: Executes in near-real time < 5s even if a query involves 300M rows 3. Highly desired: Gives me a way of doing dbt-like DAGs, where I can execute a DAG of actions (including external api calls) based on results of the query 4. Highly desired: allows me to write queries in standard SQL 5. Desired: true real time (big queries executing w/ subsecond latency)
What are the best options out there? It seems like Apache Flink enables this, but there also seem to be a number of other projects out there that may enable some or all of what I'm describing, including:
- kSQL - Arroyo - Proton - Kafka Streams - Snowflake's Snowpipe Streaming - Benthos - RisingWave - Spark Streaming - Apache Beam - Timely Dataflow and derivatives (Materialize, Bytewax, etc.)
Any recommendations on the best tool for the job? Are there interesting alternatives that I haven't named?
by chucklarrieu on 5/10/24, 5:26 PM
What is the business requirement? What are the technical specifications required to meet the need? From there, we can start to consider architecture solutions.
Storage is relatively cheap compared to compute. Most stream processors require or at least highly encourage you to provide them direct access to the input data instead of calling an external system.
by amath on 5/10/24, 6:11 PM
Most of those listed can meet your first 2 requirements. Looking further down the list, your requirement of SQL and a DAG type of representation will limit the list to only a few. I don't know if many of those listed provide both of those capabilities.
If you relax the SQL constraint, more of them are applicable like Bytewax and Kafka-streams.
by zX41ZdbW on 5/10/24, 6:56 AM