from Hacker News

Citus 12: Schema-based sharding for PostgreSQL

by mulander on 7/18/23, 6:22 PM with 45 comments

  • by mslot on 7/18/23, 7:32 PM

    Nice to see this on HN :)

    The high-level is: You enable a setting and every CREATE SCHEMA creates a new shard. All the tables in the schema will be co-located so you can have efficient joins & foreign keys between the tables.

    On top of that, you can also have reference tables that are replicated to all nodes, again for fast joins & foreign keys with all schemas.

    Everything else is about making every PostgreSQL feature work as seamlessly as if there was no sharding. You can still do things like transactions across schemas, create and use custom types, access controls, work with other extensions, use procedures, etc.

  • by metadat on 7/18/23, 7:46 PM

    Does Oracle support anything like this? Or any other DBMS in widespread use, for that matter?

    The promise behind this approach to DB sharding has great potential. Simultaneously impressive, novel, and badass.

    I wish this had been available ten years ago at a few of my startups!

  • by pickledish on 7/18/23, 7:21 PM

    Hm, question for people a bit more familiar with Postgres -- what is meant by "schema" here?

    My definition is "the columns and column types of a table", but, that doesn't seem to make sense with what they're talking about here ("large" and "small" schemas probably aren't referring to wide and narrow tables for example, and I don't see how sharding by my definition of "schema" could even make sense anyways)

  • by skunkworker on 7/18/23, 10:26 PM

    Having used schema based sharding in postgres before, I hope citus has a way around connection pooling and pgbouncer, as if you use pgbouncer transaction pooling your search_path could disappear at any time.
  • by stevefan1999 on 7/18/23, 11:33 PM

    What happens if one node lost their shards due to external event? (e.g. Disk corruption, physcially destroyed like OVH Cloud) I do understand we still have to actively backup but I rather not serve any data than serving wrong data at the time of severe outage. Also I want to see any forward error correction code (FECC) would be implemented in Citus so we can do this on the fly rather than relying on RAID, e.g. RAID10, RAIDZ1, RAIDZ2
  • by asah on 7/18/23, 7:40 PM

    love citus! this is useful.

    one gotcha: schemas are a weird old thing in SQL that are kinda the worst of all worlds, basically more like prefixing your SQL object names than a real level of indirection. Schema objects can't be manipulated as a batch, they don't provide the isolation / access control of databases, and can't be manipulated in DML and require stored procedures to manipulate.

  • by binwiederhier on 7/20/23, 4:05 AM

    It is worth noting that Citus was acquired by Microsoft a few years ago (not a secret, but may not be obvious to everyone), and they have since shifted heavily towards the Azure side of the world.

    Back when it happened we tried to buy their product and they were not sure if the Citus standalone product was even going to exist, and they refused to demo it even. Odd timing possibly, but it's a data point.

    On the positive side of things, the shard rebalancing was not open source back then iirc, which made the open source version pretty useless. Now it seems to be open source: https://www.citusdata.com/product/comparison -- pretty cool.

    I'd still be careful to bank on it as a Citus only customer or open source user.

  • by mariocesar on 7/18/23, 8:48 PM

    I'm having trouble getting CitusDB to work with RDS, even though I'd really like to use it with AWS. Whenever I try to research how to make it happen, I get stuck in a lot of challenges and end up concluding that it might not be possible. While I could use Azure instead, I'm hesitant because I have a lot of resources and infrastructure in AWS and it wouldn't make sense to move the database layer. The idea of having to do DBA and Ops work on my own EC2 instances is overwhelming. I would appreciate any guidance on how to use RDS with Citus on their documentation. It's concerning that Citus might be withholding information about the possibility of using RDS, given that Microsoft is its partner.
  • by __s on 7/18/23, 10:55 PM

    What's advantage over having tenant id as distribution column? Seems like you make schema name the distribution column. Maybe gross setups where same name function definition varies between schemas (been there done that, don't want to do it again)

    Seems like article only offers ease of use. Guess I've never used microservices enough to consider that use case

    Couldn't the microservice case be handled by having distributed tables with no distribution column? ie today I'd create a distribution column & fill it with the same value on every row

    Can one have a reference schema which can be efficiently used alongside every other schema? Guess that's public schema with create_reference_table/create_distributed_function

  • by potamic on 7/19/23, 6:30 PM

    Doesn't sharding imply horizontal partitioning? Some of their examples show vertical partitioning use cases. Not sure how I feel about that. I can't imagine why you would do microservices and host their databases as schemas on the same instance.