by ctenb on 4/26/25, 5:52 PM with 61 comments
by ilyagr on 4/27/25, 3:25 AM
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
|cell<tab>|
\\<tab >|
(where `<tab >` represents a single tab character regardless of the number of spaces)Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
|a<tab ><tab>|b
~tab pipe<tab>|tab pipe
(with literal words "tab" and "pipe"). Something nicer might also be possible.*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
by aidenn0 on 4/27/25, 2:55 AM
by karmakaze on 4/27/25, 12:37 AM
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
by TheTaytay on 4/27/25, 5:38 AM
It has some nice properties: 1) it’s many fewer tokens than JSON. 2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV. 3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
by Rhapso on 4/27/25, 3:48 AM
by Hackbraten on 4/26/25, 6:01 PM
by DrillShopper on 4/26/25, 11:39 PM
ASCII (and through it, Unicode) has these values specifically for this purpose.
by helix278 on 4/26/25, 7:14 PM
by montroser on 4/26/25, 10:29 PM
> A cell starts with | and ends with one or more tabs.
|one\t|two|three
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
by imtringued on 4/27/25, 10:56 AM
TPSV solves none of that and makes things worse.
by Hashex129542 on 4/26/25, 10:32 PM
by stevage on 4/26/25, 11:23 PM
Also it doesn't seem to say anything about the header row?
by CJefferson on 4/27/25, 12:20 AM
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
by bvrmn on 4/26/25, 10:43 PM
by AstroJetson on 4/26/25, 10:48 PM
Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?