Why isn’t there a decent file format for tabular data?

Why isn’t there a decent file format for tabular data?.
Tabular data is everywhere. I support reading and writing tabular data in various formats in all 3 of my software application. It is an important part of my data transformation software. But all th…

Read in full here:

This thread was posted by one of our members via one of our news source trackers.

1 Like

Corresponding tweet for this thread:

Share link for this tweet.

1 Like

Not sure if another file format is a good idea.

3 Likes

CSV/TSV seems fine? And what on earth with this quote:

One quote in the wrong place and the file is invalid. It is difficult to parse efficiently using multiple cores, due to the quoting (you can’t start parsing from part way through a file).

Uh, yeah if your file is broken then your file is broken, not much you can do about that other than fix your file, same thing in, well, any format.

And you can parse from multiple cores, split on newlines (assuming your fields have escaped newlines as they should), which is basically what you would need to do with any format.

Don’t even get me started on Excel’s proprietary, ghastly binary format.

Excel is XML, not some ghastly binary format (just a ghastly textual format, lol).

Encoding is always UTF-8

Sure CSV can be.

Values stored in row major order (row 1, row2 etc)

As CSV already is.

Columns are separated by \u001F (ASCII unit separator)

Sure make a CSV with that as the separator (almost every CSV parser I’ve seen lets you specify whatever separator character that you want). Call it SSV or so?

Rows are separated by \u001E (ASCII record separator)

Loving all these unreadable character codes yet complaining about binary files eh? Lol. But sure can do that as CSV as well, fewer libraries but some still let you specify the row character.

No escaping. If you want to put \u001F or \u001E in your data – tough you can’t. Use a different format.

NO! BAD! A storage format that can’t handle anything and everything that might appear in it is broken, just outright broken.

A quick fix for this would be to make the record delimiter a \u001E character followed by an LF character. Any LF that comes immediately after an \u001E would be ignored when parsing. Any LF not immediately after an \u001E is part of the data. I don’t know about other editors, but it is easy to view and edit in Notepad++.

Oh yay, now you need to hold more state!

3 Likes