from Hacker News

Super-Structured Data: Rethinking the Schema

by mccanne on 5/17/22, 2:19 PM with 46 comments

by simonw on 5/17/22, 9:52 PM
> The idea here is that instead of manually creating schemas, what if the schemas were automatically created for you? When something doesn’t fit in a table, how about automatically adding columns for the missing fields?
I've been experimenting with this approach against SQLite for a few years now, and I really like it.
My sqlite-utils package does exactly this. Try running this on the command line:
```
    brew install sqlite-utils
    echo '[
      {"id": 1, "name": "Cleo"},
      {"id": 2, "name": "Azy", "age": 1.5}
    ]' | sqlite-utils insert /tmp/demo.db creatures - --pk id
    sqlite-utils schema /tmp/demo.db
```
It outputs the generated schema:
```
    CREATE TABLE [creatures] (
       [id] INTEGER PRIMARY KEY,
       [name] TEXT,
       [age] FLOAT
    );
```
When you insert more data you can use the --alter flag to have it automatically create any missing columns.
Full documentation here: https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...
It's also available as a Python library: https://sqlite-utils.datasette.io/en/stable/python-api.html
by mamcx on 5/17/22, 4:07 PM
Note: The relational model (even SQL) is THIS.
Despite the claims, SQL is NOT "schema-fixed".
You can 100% create new schemas, alter them and modify them.
What actual happens is that if you have a CENTRAL repository of data (aka "source of truth"), then you bet you wanna "freeze" your schemas (because is like a API, where you need to fulfill contracts).
--
SQL have limitations in lack of composability, the biggest reason "NoSQL" work is this: A JSON is composable. A "stringy" SQL is not. If SQL were really around "relations, tupes" like (stealing from my project, TablaM):
```
    [Customer id:i32, name:Str; 1, "Jhon"]
```
then developers will have less reason to go elsewhere.
by CharlesW on 5/17/22, 4:49 PM
The words "anarchy" and "authoritarianism" seem unnecessarily emotional and pejorative, and because of their semantic baggage I personally wouldn't use them in a professional situation. The author counts on the emotional color of those words to attempt an argument that both are somehow bad.
Instead of those words I'd suggest something like "schema on write" vs. "schema on read", or "persisted structured" vs. "persisted unstructured". "Document" vs. "relational" doesn't quite capture it, since unstructured data can have late-binding relations applied at read time, and structured data doesn't have to be relational.
And of course, modern relational databases can store unstructured data as easily as structured data.
by anentropic on 5/17/22, 6:21 PM
The first few sections of this post nearly lost me, waffling on about NoSQL vs whatever.
Eventually we get to the meat:
> For example, the JSON value
```
  {"s":"foo","a":[1,"bar"]}
```
> would traditionally be called “schema-less” and in fact is said have the vague type “object” in the world of JavaScript or “dict” in the world of Python. However, the super-structured interpretation of this value’s type is instead:
> type record with field s of type string and field a of type array of type union of types integer and string
> We call the former style of typing a “shallow” type system and the latter style of typing a “deep” type system. The hierarchy of a shallow-typed value must be traversed to determine its structure whereas the structure of a deeply-typed value is determined directly from its type.
This is a bit confusing, since JSON data commonly has an implicit schema, or "deep type system" as this post calls it, and if you consume data in any statically-typed language you will materialise the implicit "deep" types in your host language.
So it seems that ZSON is sort of like a TypeScript-ified version of JSON, where the implicit types are made explicit.
It seems the point is not to have an external schema that documents must comply to, so I guess at the end of the day has similar aim to other "self-describing" message formats like https://amzn.github.io/ion-docs/ ? i.e. each message has its own schema
So the interesting part is perhaps the new data tools to work with large collections of self-describing messages?
by troelsSteegin on 5/17/22, 2:59 PM
It looks like the use case is specifying types for dataflow operators (aka endpoints for dataflow pipes) [0] and I surmise composition should be super easy. I was surprised not to see any mention of XML or XML Schema as prior art, especially with their discussion of schema registries. Edit: Oh, the point of reference is Kafka [1]
[0] https://zed.brimdata.io/docs/language/overview/ [1] https://docs.confluent.io/platform/current/schema-registry/i...
by kmerroll on 5/17/22, 8:27 PM
Interesting discussion, but buried in a lot of legacy thinking about schemas and personally, I don't find Yet-Another-Schema-Abstraction (YASA)™ layer very compelling when better solutions in functional programming and semantic ontologies are far ahead in this area.
Suggest looking into JSON-LD which was intended to solve many of the type and validation use-cases related to type and schema.
by difflens on 5/17/22, 3:31 PM
Perhaps I don't understand their use case fully, but it seems to me that every schema can be defined as a child protobuf message, and each child can then be added to a oneof field of a parent protobuf message. This way, you get the strict/optional type checks that are required, and the efficiency and ecosystem around protobufs.
by tabtab on 5/18/22, 7:31 PM
"Dynamic Relational" needs to be implemented. Columns and (optionally) tables are "create on write". If you issue "SELECT nonExistingColumn FROM myTable" you get nulls (if rows exist), not an error. One can incrementally "lock down" the schema as a project matures by adding constraints. Unlike the "NoSql" movement, it does not throw out most of RDBMS concepts, just tweaks them only enough to be dynamic-friendly. This reduces the learning curve.
by natemcintosh on 5/17/22, 7:18 PM
So it sounds like one of the advantages of the Zed ecosystem is that its data can go into three file formats (zson, zng, zst), each designed for a specific use case, and convert between them easily and without loss.
And it seems like the newer "zed lake" format is like a large blob managed by a server. Can you also convert data to and from and the file formats to the lake format? What is the lake's main use case?
by bthomas on 5/17/22, 4:45 PM
I didn't follow this part:
> EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.
I think this implies that serializing external data to zson is easier than writing an INSERT into edgedb, but not sure why that would be.
by munro on 5/17/22, 7:36 PM
I love the idea of getting rid of tables, when developing application code I'm often thinking in terms of Maps/Sets/Lists--I wish I could just take that code and make it persistent. PRIMARY KEY is really like a map. Also I wish I had transactional memory in my application. Not sure what the future looks like, but I am loving all this development in the database space.
by thinkharderdev on 5/17/22, 7:20 PM
Arrow has union types (as well as structs and dictionary types). Parquet doesn't but I think it has an intentionally shallow types system to allow flexibility in encoding. Basically everything is either a numeric or binary and the logical type for binary columns is defined in metadata. So you can use, for instance, Arrow as the encoding.
by SPBS on 5/17/22, 4:47 PM
This is a data serialization format, not a replacement for storing your business data. Your business data needs to have the same schema enforced everywhere, otherwise how are you going to reconcile your user data now and your user data 5 months ago if their schemas are radically different?
by ccleve on 5/17/22, 3:12 PM
tldr; Don't use relational tables or unstructured document databases. Instead use structured types. The "schema" here is ultimately a collection of independent objects / classes with well-defined fields.
Ok, fine. But I'm not sure how this helps if you have six different systems with six different definitions of a customer, and more importantly, different relationships between customers and other objects like orders or transactions or locations or communications.
I don't see their approach as ground-breaking, but it is definitely worthy of discussion.
by loquisgon on 5/17/22, 4:10 PM
I got interested when the different spectrum points of json and relational were contrasted. So I read the whole thing. I got lost and disheartened when the new terminology, starting with the super-structured name was introduced and completely went downhill with the other z names. Maybe it's just me and maybe it is like quantum mechanics and any other innovation where new names don't make sense and feel ugly.
by feoren on 5/17/22, 10:24 PM
Wow, what a waste of time. I've been doing it correctly for so long that I forget that virtually everyone else on the planet has no idea how to build a good data model. What pisses me off is that I actually have the right answer on how to avoid all of this pain, but if I typed it out here I'd either waste my time and get ignored or (much, much less likely) get my idea poached. It takes hours to fully communicate anyway. What do you do when you know you're sitting on an approach & tech that could revolutionize the X-hundred-billion-dollar data management industry but you can barely even get your own fucking employer to take you seriously?
Anyway this article is crap and gets everything wrong, just like all of you do. Whatever, nothing to see here I guess.