by solidangle on 4/24/19, 5:30 PM with 54 comments
by georgewfraser on 4/24/19, 8:52 PM
1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).
2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).
This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.
Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.
There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake
by ekzhu on 4/24/19, 9:35 PM
Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.
[1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf
by alexchamberlain on 4/24/19, 6:22 PM
by mobileexpert on 4/24/19, 5:51 PM
https://github.com/apache/incubator-iceberg
https://github.com/apache/incubator-hudi
Happy to see Delta go open source.
by MrPowers on 4/24/19, 6:11 PM
by mmrezaie on 4/24/19, 6:16 PM
by tlrobinson on 4/24/19, 6:33 PM
by dikei on 4/25/19, 1:25 AM
When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.
by playing_colours on 4/24/19, 8:47 PM
There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.
Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...
by iblaine on 4/24/19, 6:07 PM
by huac on 4/24/19, 6:56 PM
by FridgeSeal on 4/25/19, 1:25 AM
by 5874-4b22-a4e0 on 4/24/19, 9:13 PM