from Hacker News

Ask HN: Stream processing engine that joins to historical data in Snowflake?

by remolacha on 5/10/24, 5:45 AM with 3 comments

Hey, I'm looking for advice on stream processing engines that allow me to do the following:

1. Required: Write a query that joins an event stream with a historical table in Snowflake 2. Required: Executes in near-real time < 5s even if a query involves 300M rows 3. Highly desired: Gives me a way of doing dbt-like DAGs, where I can execute a DAG of actions (including external api calls) based on results of the query 4. Highly desired: allows me to write queries in standard SQL 5. Desired: true real time (big queries executing w/ subsecond latency)

What are the best options out there? It seems like Apache Flink enables this, but there also seem to be a number of other projects out there that may enable some or all of what I'm describing, including:

- kSQL - Arroyo - Proton - Kafka Streams - Snowflake's Snowpipe Streaming - Benthos - RisingWave - Spark Streaming - Apache Beam - Timely Dataflow and derivatives (Materialize, Bytewax, etc.)

Any recommendations on the best tool for the job? Are there interesting alternatives that I haven't named?

by chucklarrieu on 5/10/24, 5:26 PM
That’s not how stream processing works? It’s absurdly costly to query snowflake for every record that streams in. So then you try to cache the snowflake dataset. But then the cache becomes stale. Honestly this sounds like an XY problem — the problem is stated as X but you’re really trying to solve something else Y.
What is the business requirement? What are the technical specifications required to meet the need? From there, we can start to consider architecture solutions.
Storage is relatively cheap compared to compute. Most stream processors require or at least highly encourage you to provide them direct access to the input data instead of calling an external system.
by amath on 5/10/24, 6:11 PM
Sounds like you are writing a pipeline to enrich a stream of data with historical data from Snowflake. This is a fairly common pattern. If the data in snowflake is not changing often, you would want to cache that somewhere for use with the stream processor to avoid that query overhead and speed things up.
Most of those listed can meet your first 2 requirements. Looking further down the list, your requirement of SQL and a DAG type of representation will limit the list to only a few. I don't know if many of those listed provide both of those capabilities.
If you relax the SQL constraint, more of them are applicable like Bytewax and Kafka-streams.
by zX41ZdbW on 5/10/24, 6:56 AM
Is it possible to pull historical data out of Snowflake? If it is an option - you can place it in ClickHouse and use a materialized view to JOIN with the historical table during insertion into the real-time table.