from Hacker News

Databricks open-sources Delta Lake to make data lakes more reliable

by solidangle on 4/24/19, 5:30 PM with 54 comments

  • by georgewfraser on 4/24/19, 8:52 PM

    There's a lot of confusion around data lakes. One source of confusion is that "data lake" versus "data warehouse" is often presented as a choice, where you can have either:

    1. A data lake, where all data is stored in its native format (CSV, JSON, ...), in an object store (S3, GCS, ...), with the schema defined on read (Hive, Presto, ...).

    2. A data warehouse, where all the data is organized in a highly structured tables (star schema) in a commercial database (Snowflake, Redshift, ...).

    This is a false choice! Modern data warehouses, particularly Snowflake and BigQuery, are fully capable of storing semi-structured data.

    Furthermore, you do not need to curate your data into a star schema before loading it. The ideal way to set up a modern data warehouse is to establish a "staging" schema that matches the source, and then transform that data into a star schema or data marts using SQL. In this scenario, your "data lake" and "data warehouse" are just two different schemas within the same database.

    There are still some scenarios where it makes sense to build a data lake in addition to a data warehouse, primarily future-proofing. I wrote a blog post where I tried to outline these scenarios: https://fivetran.com/blog/when-to-adopt-a-data-lake

  • by ekzhu on 4/24/19, 9:35 PM

    We (data curation lab at Univ of Toronto) are doing research in data lake discovery problems. One of the problems we are looking at is how to efficiently discover joinable and unionable tables. For example, find all the rental listings from various sources to create a master list (union); or find tables such as rental listings and school districts that can be used to augment each other (join). The technical challenges in finding joinable and unionable tables in data lakes involve the following: (1) the data schema is often inconsistent and poorly managed, so we can’t simply rely on that schema; and (2) the scale of data lakes can be in the order of hundreds of thousands of tables, making a content based search algorithm expensive. We came up with some solutions that are based on data sketches with several published papers [1,2,3]. The python library “datasketch” was a byproduct if these work.

    Many challenges remain though, and we would like to explore some of the more pertinent ones. In fact, we are conducting a survey to understand the current state of data lakes in industry and the challenges experienced. If you're interested in learning more, see what we came up with here: https://www.surveymonkey.com/r/R7MYXSJ - would love to see what the HN community thinks about the current state of data lakes.

    [1] http://www.vldb.org/pvldb/vol9/p1185-zhu.pdf [2] http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf [3] http://www.cs.toronto.edu/~ekzhu/papers/josie.pdf

  • by alexchamberlain on 4/24/19, 6:22 PM

    I don't really understand the concept of Data Lake, and wikipedia isn't helping much... is it just a buzzword for a collection of data stores?
  • by mobileexpert on 4/24/19, 5:51 PM

    Other efforts in improving the parquet datasets on cloud storage world:

    https://github.com/apache/incubator-iceberg

    https://github.com/apache/incubator-hudi

    Happy to see Delta go open source.

  • by MrPowers on 4/24/19, 6:11 PM

    They tried to keep it closed and sell it as a premium service, but looks like they need help from the open source community to make the product better. Great to see. Databricks has its roots in open source (the founder created Spark) and it's great that they're still making a lot of open source code rather than making everything private.
  • by mmrezaie on 4/24/19, 6:16 PM

    What are the other alternatives for data lakes that can be used (both open source and close)?
  • by tlrobinson on 4/24/19, 6:33 PM

    I appreciate the thought TechCrunch put into the image representing ACID-compliant data lakes.
  • by dikei on 4/25/19, 1:25 AM

    What I don't like about these ACID storage layers is they reduce compatibility between different query engine. For example, Spark cannot read Hive ACID tables natively and Hive cannot read Spark Delta tables either. Then there's other tools such as Presto or Drill which can read neither.

    When you use an ACID storage layer, you're kinda locked into one solution for both ETL and query, which is not nice.

  • by playing_colours on 4/24/19, 8:47 PM

    Recently I was interested to learn more on Data Lakes, how to design and maintain them.

    There is a lot of information in articles, blogs, but I prefer books as a solid source of structured and aggregated information.

    Surprisingly, I found just a single proper book on the topic: https://www.amazon.com/Enterprise-Big-Data-Lake-Delivering/d...

  • by iblaine on 4/24/19, 6:07 PM

    What's the difference between a Delta Lake and Change Data Capture? Seems like in both cases you're creating a type 2 dimension against a source table.
  • by huac on 4/24/19, 6:56 PM

    Are there other ways of implementing ACID transactions on Spark tables?
  • by FridgeSeal on 4/25/19, 1:25 AM

    Sounds cool, but then I'd have to use Spark...
  • by 5874-4b22-a4e0 on 4/24/19, 9:13 PM

    Cloud -> Data lake -> Data stream -> Data Ocean -> Cloud