from Hacker News

Databases and why their complexity is now unnecessary

by adamfeldman on 1/9/24, 6:01 PM with 352 comments

by davedx on 1/9/24, 7:01 PM
> The better approach, as we’ll get to later in this post, is event sourcing plus materialized views.
Right, so the solution is more complexity? Of course it is. Sigh
by pgaddict on 1/9/24, 7:46 PM
Did I miss something, or does that post completely omit concepts like concurrency, isolation, constraints and such? And are they really suggesting "query topologies" (which seem very non-declarative and essentially making query planning/optimization responsibility of the person writing them) are a superior developer environment?
by bob1029 on 1/9/24, 6:39 PM
> No single data model can support all use cases.
In theory, there is no domain (or finite set of domains) that cannot be accurately modeled using tuples of things and their relations.
Practically speaking, the scope of a given database/schema is generally restricted to one business or problem area, but even this doesn't matter as long as the types aren't aliasing inappropriately. You could put a web retailer and an insurance company in the same schema and it would totally work if you are careful with naming things.
Putting everything into exactly one database is a superpower. The #1 reason I push for this is to avoid the need to conduct distributed transactions across multiple datastores. If all business happens in one transactional system, your semantics are dramatically simplified.
by russdpale on 1/9/24, 7:47 PM
Seems like a bunch of buzzwords and such. I've been working with databases for years for one of the largest companies in the world and no one has ever said "topology" before.
Any time I would save with this is wasted on learning java and this framework.
There isn't anything wrong with databases.
by shay_ker on 1/9/24, 6:57 PM
What's an ELI5 of Rama? I found the docs confusing as well: https://redplanetlabs.com/docs/~/index.html
Please no buzzwords like "paradigm shift" or "platform". If diagrams are necessary, I'd love to read a post that explains clearer.
by danscan on 1/10/24, 4:29 PM
I did a year long project to build a flexible engine for materialized views onto 1-10TB live event datasets, and our architecture was roughly converging toward this idea of "ship the code to where the indexes are" before we moved onto a different project
I'm very compelled by Rama, but unfortunately won't adopt it due to JVM for totally irrational reasons (just don't like Java/JVM). Would love to see this architecture ported!
by kgeist on 1/10/24, 8:52 PM
>The solution is to treat these two concepts separately. One subsystem should be used for representing the source of truth, and another should be used for materializing any number of indexed stores off of that source of truth. Once again, this is event sourcing plus materialized views.
At work we decouple the read model from the write model: the write model ("source of truth") consists of traditional relational domain models with invariants/costraints and all (which, I think, is not difficult to reason about for most devs who are already used to ORM's), and almost every command also produces an event which is published to the shared domain event queue(s). The read model(s) are constructed by workers consuming events and building views however they fit (and they can be rebuilt, too). For example, we have a service which manages users ("source of truth" service), and another service is just a view service (to show a complex UI) which builds its own read model/index based on the events of the user service (and other services). Without it, we'd have tons of joins or slow cross-service API calls.
Technically we can replay events (in fact, we accidentally once did it due to a bug in our platform code when we started replaying ALL events for the last 3 years) but I don't think we ever really needed it. Sometimes we need to rebuild views due to bugs, but we usually do it programmatically in an ad hoc manner (special scripts, or a SQL migration). I don't know how our architecture is properly called (I never heard anyone call it "event sourcing").
It's just good old MySQL + RabbitMQ and a bit of glue on top (although not super-trivial to do properly I admit: things like transactional outboxes, at least once delivery guarantee, eventual consistency, maintaining correct event processing order, event data batching, DB management, what to do if an event handler crashes? etc.) So I wonder, what we're missing without Rama with this setup, what problems it solves and how (from the list above) provided we already have our battle-tested setup and it's language-agnostic (we have producers/consumers both in PHP and Go) while Rama seems to be more geared towards Java.
by avereveard on 1/10/24, 7:58 PM
Eh materializing data upon mutation can bring you some gains if your product does like one thing and needs to do it very fast. But as soon as you get complex transactions with things that need to be updated in a atomic write or you want to add a new feature that needs data organized in a different way then you're in trouble.
Also deeply unsatisfied of "just slap an index on it" that was lightly trow around on the part about building an application. The index is a global state, it was just moved one step further down the layer.
by ram_rar on 1/9/24, 7:12 PM
Even after reading this doc [1], I am not clear on who is the target audience and what are you trying to solve? It would be helpful to take a real world example and translate how easy /efficient it would be to do this via RAMA.
[1] https://redplanetlabs.com/docs/~/why-use-rama.html#gsc.tab=0
by ecshafer on 1/10/24, 8:27 PM
I don't see how you can claim this is proved by a "twitter scale mastodon client" unless you are actually running a 40m daily user website. Simulating a real environment, and the accompanying code and infra changes, real users, network usage, etc is impossible.
by brianmcc on 1/9/24, 7:11 PM
We do go in circles/cycles quite a lot as an industry. I wonder if the trend is back towards SQL, right now, too many teams been burned by Event Sourcing when they just needed a decent SQL DB? Just idle conjecture....
by kopos on 1/10/24, 8:25 AM
The comments here are needlessly pessimistic and dismissive of a new data flow paradigm. In fact, this looks like the best NoSQL experience there is. SQL while is a standard now, had to prove itself many times over and also was a result of a massive push by few big tech backers.
Rama still looks like it needs some starter examples - that is all.
From what i could gather reading the documentation over few weeks... Rama is an engine supporting Stored Procedure over NoSQL systems. That point alone is worth a million bucks. I hope it lives up to the promise.
Now back to my coding :D
by bccdee on 1/10/24, 7:01 PM
Reminds me a lot of "Turning the Database Inside-Out"[1], but I think Red Planet Labs is overstating their point a little. TtDIO is a lot more careful with its argument, and it doesn't claim to have some sort of silver bullet to sell me.
[1]: https://www.confluent.io/blog/turning-the-database-inside-ou...
by chrisjc on 1/10/24, 6:03 PM
I haven't read through all of the documentation and while I actually love Java, I'm surprised that there isn't some kind of declarative language (DDL but more than just the "data" in Data Description Language) even if that means relying on non-standard SQL objects/conventions.
```
    CREATE OR REPLACE MODULE MY_MOD ...
    CREATE OR REPLACE PSTATE MY_MOD.LOCATION_UPDATE (USER_ID NUMBER, LOC...
    CREATE PACKAGE MY_PACKAGE USING MY_MOD
    DEPLOY OR REDEPLOY MY_PACKAGE TASKS = 64 THREADS=16 ...
```
Perhaps the same could be said for DML (Data Manipulation Language). I can imagine most DML operations (insert/update/delete/merge) could be used, while event-source occurs behind the scenes with the caller being none the wiser. Might there be an expressive way to define the serialization of parts of the DML (columns) down to the underlying PState. After all, if the materialized version of the PStates is based on expressions to the underlying data, then surely the reverse expression would be enough to understand how to mutate said underlying data. Or at least a way for Rama to derive the respective event-sourcing processes and handle it behind the scenes? Serialization/deserialization could also defined in SQL-like expressions as part of the schema/module.
I say all of this while being acutely aware that there is undoubtedly as many people out there that dislike SQL as there are that dislike Java, or maybe more.
I really like this:
> Every backend that’s ever been built has been an instance of this model, though not formulated explicitly like this. Usually different tools are used for the different components of this model: data, function(data), indexes, and function(indexes).
by the_duke on 1/9/24, 7:20 PM
Every time I tried to use event sourcing I have regretted it, outside of some narrow and focused use cases.
In theory ES is brilliant and offers a lot of great functionality like replaying history to find bugs, going back to any arbitrary point in history, being able to restore just from the event log, diverse and use case tailored projections, scalability, ...
In practice it increases the complexity to the point were it's a pointless chore.
Problems:
* the need for events, aggregates and projections increases the boilerplate tremendously. You end up with lots of types and related code representing the same thing. Adding a single field can lead to a 200+ LOC diff
* a simple thing like having a unique index becomes a complex architectural decision and problem ... do you have an in-memory aggregate? That doesn't scale. Do you use a projection with an external database? well, how do you keep that change ACID? etc
* you need to keep support for old event versions forever, and either need code to cast older event versions into newer ones, or have a event log rewrite flow that removes old events before you can remove them from code
* if you have bugs in you can end up needing fixup events / event types that only exist to clean up , and as above, you have to keep that around for a long time
* similarly, bugs in projection code can mess up the target databases and require cumbersome cleanup / rebuilding the whole projection
* regulation like GDPR requires deleting user data, but often you can't / don't want to just delete everything, so you need an anonimizing rewrite flow. it can also become quite hard to figure out where the data actually is
* the majority of use cases will make little to no use of the actual benefits
A lot of the above could be fixed with proper tooling. A powerful ES database that handles event schemas, schema migrations, projections, indexes, etc, maybe with a declarative system that also allows providing custom code where necessary.
I'll take a look at Rama I guess.
by cmrdporcupine on 1/9/24, 7:09 PM
Data models are restrictive
That's kind of the point. Model your data. Think about it. Don't (mis)treat your database as a "persistence layer" -- it's not. It's a knowledge base. The "restriction" in the relational model is making you think about knowledge, facts, data, and then structure them in a way that is then more universal and less restrictive for the future.
Relations are very expressive and done right is far more flexible than the others named there. That was Codd's entire point:
https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
"Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation) ..." and then goes on to explain how the predicate-logic based relational data model is a more universal and flexible model that protects users/developers from the static impositions of tree-structured/network structure models.
All the other stuff in this article is getting stuck in the technical minutiae of how SQL RDBMSs are implemented (author seems obsessed with indexes). But that's somewhat beside the point. A purely relational database that jettisons SQL doesn't have to have the limitations the author is poking at.
It's so frustrating we're still going over this stuff decades later. This was a painful read. People developing databases should already be schooled in this stuff.
by specialist on 1/9/24, 8:09 PM
https://blog.redplanetlabs.com/2024/01/09/everything-wrong-w...
> It’s common to instead use adapter libraries that map a domain representation to a database representation, such as ORMs. However, such an abstraction frequently leaks and causes issues. ...
FWIW, I'm creating a tool (strategy) that is neither an ORM or an abstraction layer (eg JOOQ) or template-based (eg myBatis). Just type safe adapters for normal SQL statements.
Will be announcing an alpha release "Any Week Now".
If anyone has an idea for how to monetize yet another database client library, I'm all ears. I just need to eat, pay rent, and buy dog kibble.
by manicennui on 1/9/24, 7:35 PM
A lot of these same problems were solved in a similar way with Datomic and xtdb.
https://www.datomic.com/ https://xtdb.com/
by kaba0 on 1/9/24, 7:46 PM
> However, storing data normalized can increase the work to perform queries by requiring more joins. Oftentimes, that extra work is so much you’re forced to denormalize the database to improve performance.
Databases have materialized views though, that solves this problem.
by w10-1 on 1/10/24, 5:39 PM
I was in favor of doubling the complexity by prefixing RDB with event logs, for retrospective QA/analysis and prospective client segregation.
Databases now are a snapshot of the data modeling and usage at a particular point in application lifecycle. We manage to migrate data as it evolves, but you can't go back in time.
Why go back? In our case, our interpretation of events (as we stuffed data into the DB) was hiding the data we actually needed to discover problems with our (bioinformatics and factory) workflow - the difference between expected and actual output that results from e.g., bad batches of reagent or a broken picker tip. We only stored e.g., the expected blend of reagents because that's all we needed for planning. That meant we had no way to recover the actions leading to that blend for purposes of retrospective quality analysis.
So my proposal was to log all actions, derive models (of plate state) as usual for purpose of present applications, but still be able to run data analysis on the log to do QA when results were problematic.
Ha ha! They said, but still :)
Event prefixing might also help in the now/later design trade-off. Typically we design around requirements now, and make some accommodation for later if it's not too costly. Using an event log up front might work for future-proofing. It also permits "incompatible" schema to co-exist for different clients, as legacy applications read the legacy downstream DB, while new ones read the upcoming DB.
For a bio service provider, old clients validate a given version, and they don't want the new model or software, while new clients want the new stuff you're building for them. You end up maintaining different DB models and infrastructure -- yuck! But with event sourcing, you can at least isolate the deltas, so e.g., HIPAA controls and auditing live in the event layer, and don't apply to the in-memory bits.
TBH, a pitch like Rama's would play better in concert with existing DB's, to incrementally migrate the workflows that would benefit from it. Managers are often happy to let IT entrepreneurs experiment if it keeps them happy and away from messing with business-critical functions.
YMMV...
by MagicMoonlight on 1/10/24, 10:25 AM
If you gut all the features like persistence and rolling back errors then you can definitely make things less complex.
But then someone wants to access their email and it turns out the server restarted so it’s gone.
by jakozaur on 1/9/24, 7:34 PM
My biggest problem with databases is that they are very hard to evolve. They accumulate a history of decisions and are in a suboptimal state. Legacy is widespread in enterprises. Oracle is still milking $ 50B+ annually, and the databases are the primary driver of why you need Oracle and why they can upsell you other products after a compliance audit.
The schema changes are hard (e.g. try to normalize/denormalize data), production is the only environment when things go wrong, in-place changes with untested revert options are default, etc.
by dcow on 1/11/24, 1:58 AM
I get weird looks when I tell people we ran for 3.5 years on an s3api in front of bucket storage. It scaled to meet our needs and was especially appropriate for our app’s storage profile. And now that the startup doesn’t exist I’m glad that I never wasted time messing with “real” DBs. There’s definitely an industry bias toward using DBs.
by csours on 1/10/24, 4:23 PM
How about this: ACID RDBMS in many cases are sugar. That is, they provide very NICE features, but those features can be implemented in other ways. In the cloud world, the sugar may not be worth the costs.
I think the weak case is much stronger than the strong case - that is, you can refactor to remove RDBMS dependencies; but that moves the complexity elsewhere.
by nojvek on 1/10/24, 1:51 AM
Read the tutorial
https://redplanetlabs.com/docs/~/tutorial1.html#gsc.tab=0
This is quite complex compared to setting up Postgres or mysql and sending some sql over some port.
I’m not sure I get what they are selling.
by jrockway on 1/10/24, 6:20 PM
A few years ago I tried writing an application (something like Status Hero for internal use) with a non-traditional database. I used Badger, which is just a transactional k/v store, and stored each "row" as protobuf value and an ID number key. (message { int id = 1 }, query by type + ID, store anything with interface { GetId() int }.)
I had additional messages for indexes and per-message-type IDs. (I like auto-incrementing IDs, sue me.) A typical transaction would read indexes, retrieve rows, manipulate them, save the rows, save the indexes, and commit.
The purity in my mind before I wrote the application was impressive; this is all a relational database is doing under the hood (it has some bytes and some schema to tell it what the bytes mean, just like protos). But it was actually a ton of work that distracted me from writing the app. The code to handle all the machinery wasn't particularly large or anything, but the app also wasn't particularly large.
I would basically say, it wasn't worth it. I should have just used Postgres. The one ray of sunshine was how easy it is to ship a copy of the database to S3; the app just backed itself up every hour, which is a better experience than I've had with Postgres (where the cloud provider deletes your backups when you delete the instance... so you have to do your own crazy thing instead).
The article is on-point about managing the lifecycle of data. Database migrations are a stressful part of every deployment. The feature I want is to store a schema ID number in every row and teach the database how to run a v1 query against a v2 piece of data. Then you can migrate the data while the v1 app is running, then update the app to make v2 queries, then delete the v1 compatibility shim. If you store blobs in a K/V store, you can do this yourself. If you use a relational model, it's harder. You basically take the app down that knows v1 of your schema, upgrade all the rows to v2, and deploy the app that understands v2. The "upgrade all the rows to v2" step results in your app being unavailable. (The compromise I've seen, and used, which is horrible is "just let the app fail certain requests while the database is being migrated, and then have a giant mess to clean up when the migration fails". Tests lower the risk of a giant mess, and selective queries result in fewer requests that can't be handled by the being-migrated database, so in general people don't realize what a giant risk they're taking. But it can all go very wrong and you should be horrified when you do this.)
by kevsim on 1/9/24, 7:11 PM
Is this Rama solution similar to the kind of thing you can get with Kafka with KTables?
If so I'd be curious how they've solved making it in anyway operational less complex to manage then a database. It's been a few years since I've run Kafka but it used to kind of be a beast.
by cryptonector on 1/10/24, 8:57 PM
Event sourcing (+ materialized views and indices) != abandon your RDBMS. You can have both. Though you might find that traditional RDBMSes don't optimize well enough in the event sourcing (+ materialized views and indices) model.
by lambda_garden on 1/10/24, 6:07 PM
```
    indexes = function(data)
    query = function(indexes)
```
How does this model a classic banking app where you need to guarantee that transfers between accounts are atomic?
by estebarb on 1/10/24, 5:31 PM
I remembers me CouchDB + incremental map reduce. Except that in CouchDB you can mutate state. Idk, but keeping all the history doesn't take a lot of space?
by marcosdumay on 1/9/24, 7:42 PM
Well, I guess it's official. The belief on software architecture dogma is so strong that we can consider it a church.
I imagine that will bring great tax benefits to programing schools.
by skywhopper on 1/9/24, 7:27 PM
This is marketing spiel masquerading as a bad take. Rama may or may not be cool tech, but the idea that they are anywhere close to being able to get rid of structured database systems for complex systems is absolutely laughable to the point that it makes me uninterested in learning more about the tech. Please tone down the hyperbole if you want serious attention.
by strangattractor on 1/10/24, 6:58 PM
Seems similar datomic.
https://www.datomic.com/benefits.html
by pulse7 on 1/10/24, 4:32 PM
Could be generalized to "Everything wrong with <placeholder> and why their complexity is now unnecessary"...
by sigmonsays on 1/10/24, 5:02 PM
this seems like a classic bait and switch post selling a product called rama
The approach here seems drastically more complicated; for simple apps, you go for a well known master->slave setup. For complicated apps you scale (shard, cluster, etc).
Pick your database appropriately
by LispSporks22 on 1/9/24, 7:26 PM
What’s micro batch streaming?
by HackerThemAll on 1/10/24, 1:21 PM
A lame marketing mumbling to persuade people to buy a specific product.
by tpl on 1/10/24, 6:03 PM
What sort of cost increase can I expect using something like this?
by es7 on 1/10/24, 3:03 PM
Did anyone else think this was satire for the first few minutes of reading it?
Calling databases global state and arguing why they shouldn’t be used was ridiculous enough that I wanted to call Poe’s Law here.
But it does look like the author was sincere. Event Sourcing is one of those cool things that seem great in theory but in my experience I’ve never seen it actually help teams produce good software quickly or reliably.
by qaq on 1/10/24, 6:13 AM
Most SQL RDBMS are a materialised view over a transaction log.
by continuational on 1/9/24, 7:44 PM
Does it have ACID transactions?
Does the indexes have read after write guarantees?
by 0xbadcafebee on 1/9/24, 7:19 PM
strapping in for the clickbait blog post...
"Global mutable state is harmful" - well... yes, that's totally correct. "The better approach [..] is event sourcing plus materialized views." .....errr... that's one approach. we probably shouldn't hitch all our ponies to one post.
"Data models are restrictive" - well, yes, but that's not necessarily a bad thing, it's just "a thing". "If you can specify your indexes in terms of the simpler primitive of data structures, then your datastore can express any data model. Additionally, it can express infinite more by composing data structures in different ways" - perhaps the reader can see where this is a bad idea? by allowing infinite data structures, we now have infinite complexity. great. so rather than 4 restrictive data models, we'll have 10,000.
"There’s a fundamental tension between being a source of truth versus being an indexed store that answers queries quickly. The traditional RDBMS architecture conflates these two concepts into the same datastore." - well, the problem with looking at it this way is, there is no truth. if you give any system enough time to operate, grow and change, eventually the information that was "the truth" eventually receives information back from something that was "indexing" the truth. "truth" is relative. "The solution is to treat these two concepts separately. One subsystem should be used for representing the source of truth, and another should be used for materializing any number of indexed stores off of that source of truth." this will fail eventually when your source of truth isn't as truthy as you'd like it to be.
"The restrictiveness of database schemas forces you to twist your application to fit the database in undesirable ways." - it's a tool. it's not going to do everything you want, exactly as you want. the tradeoff is that it does one thing really specifically and well.
"The a la carte model exists because the software industry has operated without a cohesive model for constructing end-to-end application backends." - but right there you're conceding that there has to be a "backend" and "frontend" to software design. your models are restrictive because your paradigms are. "When you use tooling that is built under a truly cohesive model, the complexities of the a la carte model melt away, the opportunity for abstraction, automation, and reuse skyrockets, and the cost of software development drastically decreases." - but actually it's the opposite: a "cohesive model" just means "really opinionated". a-la-carte is actually a significant improvement over cohesion when it is simple and loosely-coupled. there will always be necessary complexity, but it can be managed easier when individual components maintain their own cohesion, and outside of those components, maintain an extremely simple, easy interface. that is what makes for more composable systems that are easier to think about, not cohesion between all of the components!
"A cohesive model for building application backends" - some really good thoughts in the article, but ultimately "cohesion" between system components is not going to win out over individual components that maintain their cohesion and join via loosely-coupled interfaces. if you don't believe me, look at the whole Internet.
by phartenfeller on 1/10/24, 10:06 AM
I have been working as a database consultant for a few years. I am, of course, in my bubble, but there are a few things I really don't enjoy reading.
> No single data model can support all use cases. This is a major reason why so many different databases exist with differing data models. So it’s common for companies to use multiple databases in order to handle their varying use cases.
I hate that this is a common way of communicating this nowadays. Relational has been the mother of all data models for decades. In my opinion, you need a good reason to use something different. And this is also not an XOR. In the relational world, you can do K/V tables, store and query documents, and use graph functions for some DBs. And relational has so many safety tools to enforce data quality (e.g. ref. integrity, constraints, transactions, and unique keys). Data quality is always important in the long run.
> Every programmer using relational databases eventually runs into the normalization versus denormalization problem. [...] Oftentimes, that extra work is so much you’re forced to denormalize the database to improve performance.
I was never forced to denormalize something. Almost always, poor SQL queries are a problem. I guess this can be true for web hyperscalers, but these are exceptions.
by morsecodist on 1/9/24, 7:08 PM
This is more an advertisement for a type of database than a statement that they are unnecessary.
From what I can tell in the article it seems their differentiator is Event Sourcing and having arbitrary complex index builders on top of the events. It seems similar to EventStoreDB[1].
I have always been interested by the concept of an event sourcing database with projections and I want to build one eventually so it is interesting to see how they have approached the problem.
Also they mention on their site:
> Rama is programmed entirely with a Java API – no custom languages or DSLs.
It makes sense why they have gone this route if they want a "Turing-complete dataflow API" but this can be a major barrier to adoption. This is a big challenge with implementing these databases in my opinion because you want to allow for any logic to build out your indexes/projections/views but then you are stuck between a new complicated DSL or using a particular language.
1: https://developers.eventstore.com/server/v23.10/#getting-sta...
by register on 1/10/24, 7:49 AM
Pure B...t. The title is deceiving and should be instead something along the lines of: How to architect an application at Mastodon scale without relying on databases. Also I would be very interested in seeing the actual technology rather than reading sensational claims about the unparalled level of scalability it supports. What does it provide in order to recover from failure and exceptions and to guarantee consistency of state?
Relational databases are and will always be necessary as they provide a convenient model for querying, aggregating, joining and reporting data.
Much of the value in a database lies in how it supports extracting value from business information rather what extreme scalability features it supports.
Try to create a decent business report from events and then we can speak again.
by benlivengood on 1/10/24, 6:04 AM
I didn't notice a mention of transactions in the article, nor of constraints. It's all fine to claim that you can compose arbitrary event source domains together and query them but IMHO the biggest power of RDBMS are transactions and constraints for data integrity. Maybe Rama comes with amazing composability features that ensure cross-domain constraints, but I would be really surprised if they can maintain globally consistent real-time transactions.
I've worked on huge ETL pipelines with materialized views (Photon, Ubiq, Mesa) and the business logic in Ubiq to materialize the view updates for Mesa was immense. None of it was transactional; everything was for aggregate statistics and so it worked well. Ads-DB and Payments used Spanner for very good reasons.
by jcrawfordor on 1/10/24, 7:36 PM
I feel like I went into this from a position of genuine interest, I'm always on the lookout for significant developments in backend architecture.
But when I hit the sentence "This can completely correct any sort of human error," I actually laughed out loud. Either the author is overconfident or they have had surprisingly little exposure to humans. More concretely, it seems to completely disregard the possibility of invalid/improper/inconsistent events being introduced by the writer... the way that things go wrong. And I don't see any justification for disregarding this possibility, it's just sort of waved away. That means waving way most of the actual complexity I see in this design, of having to construct your PState data models from your actual, problematic event history. Anyone that's worked with ETLs over a large volume of data will have spent many hours on this fight.
I think the concept is interesting, but the revolutionary zeal of this introduction seems unjustified. It's so confident in the superiority of Rama that I have a hard time believing any of the claims. I would like to see a much more balanced compare/contrast of Rama to a more conventional approach, and particularly I would like to see that for a much more complex application than a Twitter clone, which is probably just about the best possible case for demonstrating this architecture.
by fifticon on 1/10/24, 7:34 AM
I work in a shop with about 6 years of event-sourcing experience (as in, our production has run on eventsourcing since 2017). My view is, that 'humans are not mature enough for eventsourcing'. For eventsourcing to work sanely, it must be used responsibly. Reality is, that people make mistakes, and eventsourcing HURTS whenever your developers don't act maturely on the common history you have built. For us, it has meant a bungee jump of "move ALL the things to eventsourcing", followed by a long slow painful 'move everything that doesn't NEED eventsourcing out of eventsourcing again, into relational database, and only keep the relevant eventsourcing parts in the actual eventsource db'.
The main consequences for us have been 'consume a huge/expensive amount of resources' to do what we already did earlier, with vastly fewer resources, at the benefit of having some things easier to do, and a lot of other things suddenly complex. In particular, it was not a 'costless abstraction', instead it forced us to always consider the consequences for our eventsourcing.
by winrid on 1/10/24, 8:12 AM
I get what they're trying to do but I'm not sure this [0] syntax is the answer.
[0] https://github.com/redplanetlabs/twitter-scale-mastodon/blob...
by gwbas1c on 1/10/24, 11:49 AM
I spent a lot of time reading this yesterday, and started looking at Rama's docs.
I think the a database that encapsulates denormalization so that derived views (caches, aggregations) are automatic is a killer feature. But far too often awesome products and ideas fail for trivial reasons.
In this case, I just can't understand how Rama fits into an application. For example:
Every example is Java. Is Rama only for Java applications? Or, is there a way to expose my database as a REST API? (That doesn't require me to jump through a million hoops and become an expert in the Java ecosystem?)
Can I run Rama in Azure / AWS / Google cloud / Oracle cloud? Are there pre-built docker images I can use? Or is this a library that I have to suck into a Java application and use some kind of existing runtime? (The docs mention Zookeeper, but I have very little experience with it.)
IE: It's not clear where the boundary between my application (Java or not) and Rama is. Are the examples analogous to sprocs (run in the DB) or business logic (run in the application)?
The documentation is also very hard. It appears the author has every concept in their head, because they know Rama inside and out, yet can't emphasize with the reader and provide simpler bits of information that convey useful concepts. There's both "too much" (mixing of explaining pstates and the depot) and "too little" (where do I host it, what is the boundary between Rama and my application?)
Another thing I didn't see mentioned is tooling: every SQL database has at least one general SQL client. (MSSQL studio, Azure data studio,) that allows interacting with the database. (Viewing the schema, ad-hoc queries, ECT.) Does Rama have this, or is every query a custom application?
Anyway, seems like a cool idea, but it probably needs some well-chosen customers who ask tough questions so the docs become mature.
by keeganpoppen on 1/10/24, 6:22 PM
this comment section has gotta be in the absolute upper echelons of non-RTFA i have seen on HN in a long time. even for HN. i acknowledge my own bias, though: i’ve been an admirer of nathan marz’s work from afar for years now, and basically trust him implicitly. but… wow. what fraction of the comments even engage with the substance of the article in any way? it’s not like they didn’t put their money where their mouth(s) is/are: they feel strongly enough about the problem that they built an entire goddamn “dont call it a database” (and business) around it.
i’ve always been pretty sympathetic to code-/application- driven indexing, storage, etc.— it just seems intuitively more correct to me, if done appropriately. the biggest “feature” of databases, afaict, is that most people dont trust themselves to do this appropriately xD. and they distrust themselves in this regard so thoroughly that they deny the mere possibility of it being useful. some weird form of learned helplessness. you can keep cramming all of your variously-shaped blocks into tuple-shaped holes if you want, but it seems awfully closed-minded to deny the possibility of a better model on principle. what principle? the lindy effect?
by big_whack on 1/9/24, 7:43 PM
A lot of the commenters seem like database fans instinctively jumping to defend databases. The post is talking about contexts where you are dealing with petabytes of data. Building processing systems for petabytes has a separate set of problems from what most people have experienced. Having a single Postgres for your startup is probably fine, that's not the point here.
There is no option to just "put it all in a database". You need to compose a number of different systems. You use your individual databases as indexes, not as primary storage, and the primary storage is probably S3. The post is interesting and the author has been working on this stuff for a while. He wrote Apache Storm and used to promote some of these concepts as the "Lambda architecture" though I haven't seen that term in a while.
by saberience on 1/10/24, 7:20 PM
Such a poorly written article doesn't encourage me to use a brand new and untrusted database; if you can't write a clear article, why would I trust your database code?
This is a thinly veiled ad for Rama but the explanation for why it's so much "better" isn't clear and doesn't make much sense. I strongly urge the author to work with some who is a clear and concise technical writer to help with articles such as these.
by fipar on 1/10/24, 6:40 PM
I can't wrap my head around the way this solves the global mutable state problem.
First, here's what I do understand about databases and global state: compared to programming variables, I don't think databases are shared, mutable global state. Instead, I see them as private variables that can be changed through set/get methods (e.g., with SQL statements if on such a DB).
So I agree shared, global state is dangerous (I'm not sure I'd call it harmful) and the reason I like databases is that I assume a DB, being specialized at managing data, will do a better job at protecting the integrity of that global state than I'd do myself from my program.
With luck, there may even be a jepsen test of the DB I'm using that lets me know how good the DB is at doing this job.
In this post there's an example of a question we'd ask Rama: “What is Alice’s current location?”
How's that answered without global state?
Because of the mention of event sourcing, I'd guess there's some component that knows where Alice was when the system was started, and keeps a record of events every time she changes her place. If Alice were the LOGO turtle, this component would keep a log with entries such as "Left 90 degrees" or "Forward 10 steps".
If I want to know where Alice is now, I just need to access this log and replay everything and that'd be my answer.
Now, I'm certain my understanding here must be wrong, at least on the implementation side, because this wouldn't be able to scale to the mastodon demo mentioned in the post, which makes me very curious: how does Rama solve the problem of letting me know where Alice is without giving me access to her state?
by jmull on 1/9/24, 9:48 PM
Pretty interesting once you read past the marketing push.
I mostly like the approach, but there are a lot of questions/issues that spring to mind (not that some of them don't already have answers, but I didn't read everything). I'll list some of them:
* I'm pretty sure restrictive schemas are a feature not a bug, but I suppose you can add your own in your ETL "microbatch streaming" implementation (if I'm reading this right, this is where you transform the events/data that have been recorded to the indexed form your app wants to query). So you could, e.g., filter out any data with invalid schema, and/or record and error about the invalid data, etc. A pain, though, for it to be a separate thing to implement.
* I'm not that excited to have my data source and objects/entities be Java.
* The Rama business model and sustainability story seem like big question marks that would have to have strong, long-lasting answers/guarantees before anyone should invest too much in this. This is pretty different and sits at a fundamental level of abstraction. If you built on this for years (or decades) and then something happened you could be in serious trouble.
* Hosting/deployment/resources-needed is unclear (to me, anyway)
* Quibble on "Data models are restrictive": common databases are pretty flexible these days, supporting different models well.
* I'm thinking a lot of apps won't get too much value from keeping their events around forever, so that becomes a kind of anchor around the neck, a cost that apps using Rama have to pay whether they really want it or not. I have questions about how that scales over time. E.g., say my has depot has 20B events and I want to add an index to a p-state or a new value to an enum... do I need to ETL 20 billion events to do routine changes/additions? And obviously schema changes get a lot more complicated than that. I get that you could have granular pstates but then I start worrying about the distributed nature of this. I guess you would generally do migrations by creating new pstates with the new structure, take as much time as you need to populate them, then cut over as gradually as you need, and then retire the old pstates on whatever timeline you want.... But that's a lot of work you want to avoid doing routinely, I'd think.
I'm starting to think of more things, but I better stop (my build finished long ago!)
by coldtea on 1/10/24, 10:07 AM
"We announced Rama on August 15th with the tagline “the 100x development platform”."
Alternative to database company peddling their ware with misinformed rant.
by twotwotwo on 1/10/24, 5:35 PM
A very simple thing about this (and many systems!) is that if your whole thing is "log writes and do the real work later", you lose read-your-writes and with it the idea that your app has a big persistent memory to play in.
This doesn't only matter if you're doing balance transfers or such; "user does a thing and sees the effects in a response" is a common wish. (Of course, if you're saving data for analytics or such and really don't care, that's fine too.)
When people use eventually-consistent systems in domains where they have to layer on hacks to hide some of the inconsistency, it's often because that's the best path they had out of a scaling pickle, not because that's the easiest way to build an app more generally.
I guess the other big thing is, if you're going to add asynchrony, it's not obvious this is where you want to add it. If you think of ETLs, event buses, and queues as tools, there are a lot more ways to deploy them--different units of work than just rows, different backends, different amounts of asynchrony for different things (including none), etc. Why lock yourself down when you might be able to assemble something better knowing the specifics of your situation?
This company's thing is riding the attention they get by making goofy claims, so I'm a bit sorry to add to that. I do wonder what happens once they're talking to actual or potential customers, where you can't bluff indefinitely.
by igammarays on 1/10/24, 6:12 PM
As a simple soloprenuer full stack dev who's never worked on an application serving more than a few thousand users, I can understand and relate to all of the problems written about here (some very compelling arguments), found myself nodding most of the way through, but I simply don't understand the proposed solution. Even the Hello World example from the docs flew over my head. And I've been programming apps in production for 15 years, and I like Java.
This needs a simple pluggable adaptor for some popular frameworks (Django, Laravel, or Ruby on Rails) and then I can begin to have an idea how this actually be used in my project.
by thaanpaa on 1/10/24, 5:42 PM
Disk drives are also large global mutable states. So is RAM at the operating system level.
The article conflates the concept of data storage with best programming practices. Sure, you should not change the global state throughout your app because it becomes impossible to manage. The database is actually an answer to how to do it transactionally and centrally without messing up your data.
by keeganpoppen on 1/10/24, 6:04 PM
this post is actually extremely based
by intrasight on 1/9/24, 7:08 PM
"Everything wrong blogs that do not support reader view"
Why does anyone create a blog like that?