from Hacker News

A modern data stack for startups (2022)

by olestr on 12/30/23, 1:28 AM with 26 comments

  • by lawrjone on 12/30/23, 8:17 AM

    This is an article from Jan 2022 when we were a company of 10, and now are a company of ~80.

    Worth some observations that:

    - We're still using Fivetran for the EL stages. Costs are much more significant than they were before and we're looking (for the high volume sources) into options like DataStream as cost savers, but it's not unmanageable.

    - dbt is still working great, even if we've done a lot of investment having now built a 5 person data team (BI, DA, DE) around it.

    - Still use Metabase but have some frustrations and are considering other options.

    - We no longer use Stitch :tada:

    There's a post that followed this on improvements we made to our setup that may be interesting: https://incident.io/blog/updated-data-stack

    The OP is still full of relevant, useful information, though (imo, of course).

  • by davedx on 12/30/23, 9:54 AM

    What's the business justification for spending this much effort (money) on data warehousing as a startup?

    I've not worked at any startups that did data warehousing, the one place I did work at where we were /starting/ to get it setup was like 300+ employees and $100M+/year revenue.

  • by 1letterunixname on 12/30/23, 11:59 AM

    Meta does it another way. Instead of one giant data warehouse or various DW silos, build a data platform API stack supporting heterogeneous storage adapters, privacy policies, regional locality policies, and retention policies underneath supporting heterogeneous D*L operations. This sidesteps duplication of and denormalizing data and allows for maximum data discovery, reporting, and reuse. And while GraphQL can't be all things to all people, it's pretty damn good. If needing {MySQL,PostgreSQL,{{other_thing}}}-compatible or REST APIs, then build them similarly.

    ETL should be minimized (except for external data, which is a bad sign of data owned or managed by a third-party) and replaced with the equivalent of dynamic or materialized "views". Prefer to create hygienic "views" of data against original data rather than mutating and destroying such original data with destructive transformations.

    Finally, have a deeply-integrated, robust, enterprise-wide, fine-grained ACL system and privacy policy to keep everyone (and system users) from accessing anything without a specific business purpose need and an approval audit record stored via some sort of blockchain-like tech.

  • by evtothedev on 12/30/23, 1:48 PM

    I’d be curious to know if you considered using something like Dagster for orchestrating these runs? Seems like a more natural choice over CircleCI for running what resembles a DAG. (And either way, thanks for sharing this.)
  • by alberth on 12/30/23, 7:53 AM

    Interesting Pricing strategy (for Incident.io)

    Plan A: $16 (month/user)

    Plan B: $10,000+ Call Us

    Plan C: Call Us

    Those are some of the steepest price cliffs I’ve ever come across.

    https://incident.io/pricing#plan-comparison

  • by rollulus on 12/30/23, 7:36 AM

    This is likely here now due to https://news.ycombinator.com/item?id=38797640 being on top of the fp and referencing it.