by ReginaDeiPirati on 5/22/21, 4:34 PM with 130 comments
by rm999 on 5/22/21, 7:43 PM
This section is totally wrong IMO. What is the alternative? "Hard" deleting records from a table is usually a bad idea (unless it is for legal reasons), especially if that table's primary key is a foreign key in another table - imagine deleting a user and then having no idea who made an order. Setting a deleted/inactive flag is by far the least of two evils.
>when multiplied across all the analytics queries that you’ll run, this exclusion quickly starts to become a serious drag
I disagree, modern analytics databases filter cheaply and easily. I have scaled data orgs 10-50x and never seen this become an issue. And if this is really an issue, you can remove these records in a transform layer before it hits your analytics team, e.g. in your data warehouse.
>soft deletes introduce yet another place where different users can make different assumptions
Again, you can transform these records out.
by pyrophane on 5/22/21, 6:13 PM
by rectang on 5/22/21, 6:00 PM
For example:
> 1. Polluting your database with test or fake data
> [...] By polluting your database with test data, you’ve introduced a tax on all analytics (and internal tool building) at your company.
by handrous on 5/22/21, 5:54 PM
I once saw something a little similar to this, except with one flavor of DB rather than several. A company you've likely heard of went hard for a certain Java graph database product, due to a combination of an internal advocate who seemed determined to be The GraphDB Guy and an engineering manager who was weirdly susceptible to marketing material. This because some of their data could be represented as graphs, so clearly a graph database is a good idea.
However: the data for most of their products was tiny, rarely written, not even read that much really, even less commonly written concurrently, and was naturally sharded (with hard boundaries) among clients. Their use of that graph database product was plainly contributing to bugginess, operational pain, mediocre performance (it was reasonably fast... as long as you didn't want to both traverse a graph and fetch data related to that graph, then it was laughably slow) and low development velocity on multiple projects.
Meanwhile, the best DB to deliver the features they wanted quickly & with some nice built-in "free" features for them (ability to control access via existing file sharing tools they had, for instance) was probably... SQLite.
by konha on 5/22/21, 6:20 PM
You can use partial indexes to only index non-deleted rows. If you are worried about having to remember to exclude deleted rows from queries: Use a view to abstract away the implementation detail from your analytics queries.
by ridaj on 5/22/21, 6:26 PM
- Having informal metrics and dimension definitions: you throw together something quick and dirty and then realize there's something semantically broken about your data definitions or unevenness. For example your Android app and iOS apps report "countries" differently, or they have meaningfully different notions of "active users"
- Not anticipating backfill/restatement needs. Bugs in logging and analytics stacks happen as much as anywhere else, so it's important to plan for backfills. Without a plan, backfills can be major fire drills or impossible.
- Being over-attentive to ratio metrics (CTR, conversion rates) which are typically difficult to diagnose (step 1 figure out whether the numerator or the denominator is the problem). Ratio metrics can be useful to rank N alternatives (eg campaign keywords) but absolute metrics are usually more useful for overall day to day monitoring.
- Overlooking the usefulness of very simple basic alerting. It's common for bugs to cause a metric to go to zero, or to be double counted, or to not be updated with recent data, but often times even these highly obvious problems don't get detected until manual inspection.
by brylie on 5/22/21, 6:46 PM
by FriedrichN on 5/23/21, 8:21 AM
And no amount of "are you really really really sure you want to delete this?" confirmations are going to fix this. You could require the whole Spongebob Squarepants ravioli ravioli give me the formuoli song and dance and people will still delete hundreds or thousands of records by accident.
by ineedasername on 5/22/21, 10:40 PM
Maybe I've been spoiled, but isn't it common to have dev, test, and prod instances? Possibly multiples of the former 2?
by dugmartin on 5/22/21, 7:08 PM
by worik on 5/23/21, 1:41 AM
It is possible to get all the pieces that are needed to build a data server for a enterprise pre built form cloud providers. Then plumb them together so the mostly work.
When the heat comes on and peopel are using it for real and it must scale (even a little) it blows up horribly.
The "leggo bricks" save a lot of time and money, and mean that people with only half a clue can build large impressive looking systems, but in the end people like ,e are picking up the pieced
by nivertech on 5/23/21, 7:26 AM
I guess if your read model is based on RDBMS then it makes sense, otherwise it depends on the database system in question (i.e. some NoSQL databases like C*[1] and Riak[2] are implementing deletes by writing special tombstone values, which is kind of soft-delete but on the implementation level - but you can't easily restore the data like in case of RDBMS).
[1] https://thelastpickle.com/blog/2016/07/27/about-deletes-and-...
[2] https://docs.riak.com/riak/kv/latest/using/reference/object-...
by jasonhansel on 5/22/21, 5:32 PM
Technically, in Postgres you can (kind of) enforce arbitrary schemas for semi-structured data using CHECK constraints. Unfortunately this isn't well-documented and NoSQL DBs often don't support similar mechanisms.
by jayd16 on 5/22/21, 7:12 PM
>The exact definition of what comprises a session typically changes as the app itself changes.
Isn't this an argument for post-hoc reconstruction? You can consistently re-run your analytics. If the definition changes in code, your persisted data becomes inconsistent, no?
by giovannibonetti on 5/22/21, 10:07 PM
A simple but useful thing is setting the database default time zone match the one where most of your team is (instead of UTC). This reduces the chance your metrics are wrong because you forgot to set the time zone when extracting the date of a timestamp.
by elchief on 5/23/21, 12:58 AM
https://dba.stackexchange.com/questions/12991/ready-to-use-d...
Instead of soft deletes, move records to a history table
I agree w session issue. Had to rebuild sessions before and is a pita compared to just recording them at source
by jerrysievert on 5/23/21, 1:35 AM
an index for every column in the database. then wondering why inserts are slow.
seriously?
by eterm on 5/22/21, 7:16 PM
Most of even these worries such as soft deletes disappear if you're not trying to keep every scrap of data you can.
Focus on the core business requirements and competencies and you likely don't need to store the minutae of every interaction forever.
by cjfd on 5/23/21, 6:32 AM
by Pxtl on 5/22/21, 6:54 PM
by intricatedetail on 5/23/21, 10:26 AM