from Hacker News

Ask HN: Options for handling state at the edge?

by CaptainJustin on 5/11/22, 4:42 PM with 38 comments

With Cloudflare workers able to be called single digit ms away from customers on much of the planet now, I wonder how I can keep state as close to the workers / lambdas as possible.

What are the options we have for handling state as the edge? What do you use in your business or service?

by powersurge360 on 5/11/22, 6:26 PM
I haven't done this, but I've been thinking about it lately. Fly.IO has had some very interesting ideas on this if you want to use a relational database. There was an article about litestream that would allow you to replicate your SQLite database to an arbitrary number of nodes, which means that every application server would have a SQLite file sitting on it for read queries, and then you can capture write queries and forward them to a write leader and let that user continue talking to that server until it replicates across your application servers.
You can do basically the same idea with any relational database, have a write leader... somewhere and a bunch of read replicas that live close to the edge.
There's also what you would call cloud native data stores that purport to solve the same issue, but I don't know much about how they work because I much prefer working w/ relational databases and most of those are NoSQL. And I haven't had to actually solve the problem yet for work so I also haven't made any compromises yet in how I explore it.
Another interesting way to go might be CockroachDB. It's wire compatible w/ PostgreSQL and supposedly automatically clusters and shares data in the cluster. I don't know very much about it but it seems to be becoming more and more popular and many ORMs seem to have an adapter to support it. May also be worth looking into because if it works as advertised you can get an RBDMS that you can deploy to an arbitrary number of places and then configure to talk to one another and not have to worry about replicating the data or routing correctly to write leaders and all that.
And again, I'm technical, but I haven't solved these problems so consider the above to be a jumping off point and take nothing as gospel.
by don-code on 5/11/22, 4:54 PM
I recently had an opportunity to build an application on top of Lambda@Edge (AWS's equivalent of Cloudflare workers). The prevailing wisdom there was to make use of regional services, like S3 and DynamoDB, from the edge. That, of course, makes my edge application depend on calls to a larger, further away point of presence.
While it's possible to distribute state to many AWS regions and select the closest one, I ended up going a different route: packaging state alongside the application. Most of the application's state was read-only, so I ended up packaging the application state up as JSON alongside the deployment bundle. At startup, it'd then statically read the JSON into memory - this performance penalty only happens at startup, and as long as the Lambda functions are being called often (in our case they are), requests are as fast as a memory read.
When the state does need to get updated, I just redeploy the application with the new state.
That strategy obviously won't work if you need "fast" turnaround on your state being in sync at all points of presence, or if users can update that state as part of your application's workflow.
by lewisl9029 on 5/11/22, 8:53 PM
A lot of great info here already, but I just wanted to add my 2c as someone who's been chasing the fast writes everywhere dream for https://reflame.app.
Most of the approaches mentioned here will give you fast reads everywhere, but writes only fast if you're close to some arbitrarily chosen primary region.
A few technologies I've experimented with for doing fast, eventually consistently replicated writes: DynamoDB Global Tables, CosmosDB, Macrometa, KeyDB.
None of them are perfect, but in terms of write latency, active-active replicated KeyDB in my fly.io cluster has everything else beat. It's the only solution that offered _reliable_ sub-5ms latency writes (most are close to 1-2ms). Dynamo and Cosmos advertise sub-10ms, but in practice, while _most_ writes fall in that range, I've seen them fluctuate wildly to over 200ms (Cosmos was much worse than Dynamo IME), which is to be expected on the public internet with noisy neighbors.
Unfortunately, I got too wary of the operational complexity of running my own global persistent KeyDB cluster with potentially unbounded memory/storage requirements, and eventually migrated most app state over to use Dynamo as the source of truth, with the KeyDB cluster as a auto-replicating caching layer so I don't have to deal with perf/memory/storage scaling and backup. So far that has been working well, but I'm still pre-launch so it's not anywhere close to battle tested.
Would love to hear stories from other folks building systems with similar requirements/ambitions!
by kevsim on 5/11/22, 7:15 PM
CloudFlare just announced their own relational DB for workers today: https://blog.cloudflare.com/introducing-d1
On HN: https://news.ycombinator.com/item?id=31339299
by michaellperry71 on 5/11/22, 8:10 PM
There are many technical solutions to this problem, as others have pointed out. What I would add is that data at the edge should be considered immutable.
If records are allowed to change, then you end up in situations where changes don't converge. But if you instead collect a history of unchanging events, then you can untangle these scenarios.
Event Sourcing is the most popular implementation of a history of immutable events. But I have found that a different model works better for data at the edge. An event store tends to be centrally localized within your architecture. That is necessary because the event store determines the one true order of events. But if you relax that constraint and allow events to be partially ordered, then you can have a history at the edge. If you follow a few simple rules, then those histories are guaranteed to converge.
Rule number 1: A record is immutable. It cannot be modified or deleted.
Rule number 2: A record refers to its predecessors. If the order between events matters, then it is made explicit with this predecessor relationship. If there is no predecessor relationship, then the order doesn't matter. No timestamps.
Rule number 3: A record is identified only by its type, contents, and set of predecessors. If two records have the same stuff in them, then they are the same record. No surrogate keys.
Following these rules, analyze your problem domain and build up a model. The immutable records in that model form a directed acyclic graph, with arrows pointing toward the predecessors. Send those records to the edge nodes and let them make those millisecond decisions based only on the records that they have on hand. Record their decisions as new records in this graph, and send those records back.
Jeff Doolittle and I talk about this system on a recent episode of Software Engineering Radio: https://www.se-radio.net/2021/02/episode-447-michael-perry-o...
No matter how you store it, treat data at the edge as if you could not update or delete records. Instead, accrue new records over time. Make decisions at the edge with autonomy, knowing that they will be honored within the growing partially-ordered history.
by deckard1 on 5/11/22, 6:35 PM
Doesn't Cloudflare have a cache API and/or cache fetch calls for workers?
A number of people are talking about Lambda or loading files, SQLite, etc. These aren't likely to work on CF. CF uses isolated JavaScript sandboxes. You're not guaranteed to have two workers accessing the same memory space.
This is, in general, the problem with serverless. The model of computing is proprietary and very much about the fine print details.
edit: CF just announced their SQLite worker service/API today: https://blog.cloudflare.com/introducing-d1/
by fwsgonzo on 5/11/22, 5:04 PM
Just build a tiny application alongside an open source Varnish instance, and use it as a local backend. It's "free" if you have decent latency to the area of Internet you care about. For example, my latency is just fine to all of Europe so I host things myself.
If you want to go one step further you can build a VMOD for Varnish to run your workloads inside Varnish, even with Rust: https://github.com/gquintard/vmod_rs_template
by F117-DK on 5/11/22, 4:48 PM
R2, KV, D1 and Durable Objects. Many options in the Cloudflare Suite.
by crawdog on 5/11/22, 5:12 PM
I have used card database files before with success. https://cr.yp.to/cdb.html
Have your process regularly update the CDB file from a blob store like S3. Any deltas can be pulled from S3 or you can use a message bus if the changes are small. Every so often pull the latest CDB down and start aggregating deltas again.
CDB performs great and can scale to multiple GBs.
by rektide on 5/11/22, 5:44 PM
I don't have a whole lot to say on this right now (very WIP), but I have a strong belief that git is a core tool we should be using for data.
Most data-formats are thick-formats, pack data into a single file. Part of the effort in switching to git would be a shift to trying to unpack our data, to really make use of the file system to store fine grained pieces of data.
It's been around for a while, but Irmin[1] (written in Ocaml) is a decent-enough almost-example of these kinds of practices. It lacks the version control aspect, but 9p is certainly another inspiration, as it encouraged state of all things to be held & stored in fine-grained files. Git I think is a superpower, but just as much: having data which can be scripted, which speaks the lingua-franca of computing- that too is a superpower.
[1] https://irmin.org/ https://news.ycombinator.com/item?id=8053687 (147 points, 8 years ago, 25 comments)
by Joel_Mckay on 5/12/22, 1:50 AM
Depends what your problem scope entails, but in general it comes down to a few key design choices and trade-offs. 1. Application layer load-balancing which allows for high-latency out-of-band idempotent transaction state consolidation between peers (rabbitMQ or Kafka often used to handle backlogs). This is hard to do, as other teams will break what they don't understand. 2. A non-polyglot solution like Erlang/Elixir which has “peer-state-awareness” and channels (functionally act like microservices) baked into the distributed OTP. i.e. the state is shared among peers through a simple abstraction, and only the assigned roles differ in non-persistent “edge” instances. 3. If you think in terms of classic data-center AWS partitions or sequentially indexed databases... you are likely approaching this wrong... just give up an go with Hyperledger Fabric.
YMMV, I just discovered my favorite game on my phone was intended for cats. ;-)
Cheers, J
by adam_arthur on 5/11/22, 10:50 PM
Depends on your product, but I'm able to do everything via Cloudflare workers, KV, DurableObjects, and use JSON files stored in Cloudflare CDN as source of truth (hosted for free btw)
Cloudflare KV can store most of what you need in JSON form, while DurableObjects let you model updates with transactional guarantees.
My app is particularly read heavy though, and backing data is mostly static (but gets updated daily).
Honestly after using Cloudflare feel like they will easily become the go to cloud for building small/quick apps. Everything is integrated much better than AWS and way more user friendly from docs and dev experience perspective. Also their dev velocity on new features is pretty insane.
Honestly didn't think that much of them until I started digging into these things.
Edit: And just today their S3 competitor entered open beta https://blog.cloudflare.com/r2-open-beta/
by efitz on 5/11/22, 7:24 PM
AWS Lambda functions have a local temp directory. I have successfully used that in the past to store state.
In my application, I had a central worker process that would ingest state updates and would periodically serialize the data to a MySQL database file, adding indexes and so forth and then uploading a versioned file to S3.
My Lambda workers would check for updates to the database, downloading the latest version to the local temp directory if there was not a local copy or if the local copy was out of date.
Then the work of checking state was just a database query.
You can tune timings etc to whatever your app can tolerate.
In my case the problem was fairly easy since state updates only occurred centrally; I could publish and pull updates at my leisure.
If I had needed distributed state updates I would have just made the change locally without bumping version, and then send a message (SNS or SQS) to the central state maintainer for commit and let the publication process handle versioning and distribution.
by ccouzens on 5/11/22, 9:52 PM
I've got a Fastly compute@edge service. My state is relatively small (less than a MB of JSON) and only changes every few hours. So I compile the state into the binary and deploy that.
I can share a blog post about this if there is interest.
It gives us very good performance (p95 under 1ms) as the function doesn't need to call an external service.
by tra3 on 5/11/22, 5:53 PM
I never thought of this, but now I have a lot of questions. Do you have an application in mind where this would be useful? Most of my experience if with traditional webapps/SaaS so I'd love to see an example.
by jFriedensreich on 5/11/22, 9:45 PM
it really depends on the type of state:
cloudflare kv store is great if the supported write pattern fits
if you need something with more consistency between pops durable objects should be on your radar
i also found that cloudant/couchdb is a perfect fit for a lot of usecases with heavy caching in the cf worker. its also possible to have multiple master replication with each couchdb cluster close to the local users, so you dont have to wait for writes to reach a single master on the other side of the world
by Elof on 5/11/22, 7:57 PM
Check out Macrometa, a data platform that uses CRDTs to manage state a N number of pops and also does real time event processing. - https://macrometa.com (full disclosure, I work at Macrometa)
by marzoeva on 5/11/22, 11:53 PM
We're tackling this problem at ReadySet. TLDR, ReadySet connects to your existing relational database and caches SQL query results at the edge based on the actual traffic patterns in those regions. Our goal is to augment rather than replace the database as the source of truth. You can sign up for our early access list here: https://readyset.io
by innerzeal on 5/12/22, 7:45 AM
You can also look at Seaplane.io - provides a postgres compatible database at the edge
by asdf1asdf on 5/11/22, 5:21 PM
You just developed your application from the cache inwards, instead of the application outwards.
Now on to develop the actual application that will host/serve your data to said cache layer.
If you learn basic application architecture concepts, you won't be fooled by sales person lies again.
by rad_gruchalski on 5/11/22, 10:51 PM
Cloudflare Workers with k/v store, R2 and their new D1 database.
by weatherlight on 5/11/22, 10:44 PM
Fly.io