from Hacker News

Common data model mistakes made by startups

by ReginaDeiPirati on 5/22/21, 4:34 PM with 130 comments

by rm999 on 5/22/21, 7:43 PM
>Soft deletes
This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.
>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag
I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.
>soft deletes introduce yet another place where different users can make different assumptions
Again, you can transform these records out.
by pyrophane on 5/22/21, 6:13 PM
I think the biggest mistake some startups make wrt their data model is not really thinking about it at all. The data model winds up being the byproduct of all the features they've implemented and the framework and the libraries they've used, rather than something that was deliberately designed.
by rectang on 5/22/21, 6:00 PM
Metabase provides business analytics, and this list of "common mistakes" is weighted towards "choices which get in the way of business analytics".
For example:
> 1. Polluting your database with test or fake data
> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.
by handrous on 5/22/21, 5:54 PM
> 5. The “right database for the job” syndrome
I once saw something a little similar to this, except with one flavor of DB rather than several. A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.
However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.
Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.
by konha on 5/22/21, 6:20 PM
> On the flip side, soft deletes require every single read query to exclude deleted records.
You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.
by ridaj on 5/22/21, 6:26 PM
I would personally add:
- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"
- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.
- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.
- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.
by brylie on 5/22/21, 6:46 PM
If your company has a subscription business model, keep a history of user's subscriptions. They change over time and it is likely you will need to measure popularity and profitability of product offerings over time. Please don't force your analytics team to rely on event logs to reconstruct a subscription history.
by FriedrichN on 5/23/21, 8:21 AM
I have seen so many people argue against soft deletes over the years. But I have also had so many instances where users 'accidentally' deleted a bunch of items and then call support to ask if there are any backups. And then I'll have to reconstruct the data from yesterday's backup plus today's changes. A soft delete will take care of this.
And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.
by ineedasername on 5/22/21, 10:40 PM
Polluting your database with test or fake data
Maybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?
by dugmartin on 5/22/21, 7:08 PM
I would add to their semi structured data fields section a suggestion to add a version or type key. Otherwise your code consuming those field may grow over time to a bunch of conditionals to figure what is in the json.
by worik on 5/23/21, 1:41 AM
In my experience I would add: Building systems out of "lego blocks".
It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.
When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.
The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced
by nivertech on 5/23/21, 7:26 AM
There are advantages for soft deletes for CRUD architecture, but are there any for CQRS/ES (Event Sourcing)?
I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).
[1] https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...
[2] https://docs.riak.com/riak/kv/latest/using/reference/object-...
by jasonhansel on 5/22/21, 5:32 PM
> Typically semi-structured data have schemas that are only enforced by convention
Technically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.
by jayd16 on 5/22/21, 7:12 PM
Whats the best way to construct a session?
>The exact definition of what comprises a session typically changes as the app itself changes.
Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?
by giovannibonetti on 5/22/21, 10:07 PM
> Queries for business metrics are usually scattered, written by many people, and generally much less controlled. So do what you can to make it easy for your business to get the metrics it needs to make better decisions.
A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.
by elchief on 5/23/21, 12:58 AM
Some enterprise data model links here:
https://dba.stackexchange.com/questions/12991/ready-to-use-d...
Instead of soft deletes, move records to a history table
I agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source
by jerrysievert on 5/23/21, 1:35 AM
the one that is missing for me, that is my personal pet peeve:
an index for every column in the database. then wondering why inserts are slow.
seriously?
by eterm on 5/22/21, 7:16 PM
A more common thing I think is just trying to collect and hoard too much data.
Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.
Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.
by cjfd on 5/23/21, 6:32 AM
It sounds like quite a few of the problems that are mentioned here can be ameliorated using views.
by Pxtl on 5/22/21, 6:54 PM
How do you reconcile the first bullet point (polluting data with test data) vs Test In Production being the modern trend? Those sound irreconcilable.
by intricatedetail on 5/23/21, 10:26 AM
I am happy that on so many projects we rejected the kool aid and just used postgres and redis. Can't remember ever troubleshooting these.