Does anybody have experience or knowledge about versioning of file formats? With xml, we have been able to auto-sense the file version from the dtd header, and use the correct reader for that. We are looking for the same functionality in file formats that are less heavyweight. What we seem to be getting:
json has a schema, which would allow versioning (as far as we understand)
sqlite has a schema, which would allow versioning (as far as we understand)
csv does not have a schema!!!! (Maybe there is some library or setup that enforces a similar functionality; e.g. PTV input files are csv, and they have some standardized comment lines that give information about version, language, etc.)
Anybody experiences with any of these? If so, could you point us to relevant material? In particular, with sqlite it seems like user_version_pragma might be the thing to use; could that be confirmed?
Thanks a lot ...
json has a schema
Indeed, although I have yet to see a real-world use of a json schema. Basically all json files I encounter are without. In the end, it just boils down to have a $id and a $schema field in the top level JSON object. (I was wrong, $id and $schema are not part of the json-data, but only of the json-schema.) This could be generalized for MATSim to have a type or version field at the top level without the official schema, in case that’s too much overhead. So, although there is a schema description available for JSON data, the data-object itself does not contain any reference to its schema, unlike the DTD/schema-definition inside XML root tags.
Also, for MATSim’s data sizes, I see more and more people starting to adapt NDJSON (newline-delimited json). Most JSON-parsers load a full json-object at once (like a DOM-parser for XML), which is impractical for large data sets. So they rather have files which contain a large number of small JSON objects, each object on one line, so the file can be parsed line by line. Where would the schema information go in such a case? in each object? that would be quite some overhead.
sqlite has a schema
sqlite is a simple SQL database file. Each SQL database has a schema, namely the tables which must be well defined. Often, one has a special table like system_data, version, metadata or something generic like this, which contains a few key-value pairs which contain a field for the data type and version it contains. Example: MBTiles specification, look for the metadata table description.
Parquet files always have a strict schema.
True, similar like sqlite has a schema.
I think the main point is not only that it has a well-defined schema, but that the data type and schema-version are part of the actual file, as they are in XML.
In XML, if the file has a dtd or schema set, I can start with a very generic parser and first read the dtd/schema version, and then branch based on that information to parse the rest of the file.
Something similar could be done with JSON, by including special fields in the root object, and also with sqlite (with a metadata table), but it would just be a MATSim convention. But I don’t see how this could be done with parquet, e.g. how would you figure out if a file is in format network_v1 or in format network_v2 if you just have the file? Or can you also store like multiple tables in a parquet file and thus have metadata section? The same problem arises with other serialization libraries, where one often already has to know what exact schema the data is in in order to correctly parse it.
So it often boils down to have some metadata in the file which provides this information, if it is not specified externally (e.g. by filename convention).
If going towards JSON: For most of MATSim’s data sizes, I would really suggest looking into NDJSON, as it would make parsing so much easier and produce much less memory-overhead. In such a case, one could specify that the first object is always a metadata object, at least containing a datatype and version information, along with additional data depending on the datatype (e.g. one could store nodeCount and linkCount for a network file, so one knows how many node objects one has to read in before the links start).
Yes!! NDJSON is much better for our large file use-cases. I’ve already used it quite a bit for visualization tasks here, e.g. postprocessing events file into a “trips.json” that can be read in but filtered by time of day.
But I haven’t been “versioning” that content at all; I have pre-existing knowledge of what I want in the file. Not great for long-term storage.
I really like the idea of playing with SQLITE for some of these tasks. A metadata table with version and schema info is easy to include, the resulting file is fast to filter, and it is implemented everywhere and optimized.
BUT… we lose the ability to grep and view in vim. NDJSON wins there.
Regarding SQLite: In my limited experience I thought the files are rather large. Not sure if SQLite uses any kind of compression internally, or if it could be enabled. But having all events uncompressed in a SQLite database along with additional lookup tables/indices… needs to be checked. Maybe this could be a good topic for the code sprint in February to come up and test different alternative file formats?
Yes that’s true (that SQLite files are big). That’s one nice thing about Parquet, you can enable very fast compress/decompress on the fly.
current intuition is to read events natively into python
consider writing events in protobuf (pbf)
coordinate this with the “compressed output” ticket
think about re-using the existing protobuf contrib