File format

The file format datadings uses is simple. A dataset is made up of six files:

  • .msgpack main data file

  • .msgpack.offsets sample start offset file

  • .msgpack.keys key file

  • .msgpack.key_hashes key hash file

  • .msgpack.filter Bloom filter file

  • .msgpack.md5 for integrity checks

Data file

Each sample is a key-value map with a string key that is unique for the dataset. In Python notation:

{"key": <unique key>, ... }

Each sample is represented by one msgpack message. The main dataset file has the extension .msgpack and contains a sequence of these messages:

<sample 1><sample 2><sample 3> ...

Note that no additional data is stored in this file. The msgpack format does not require the length of the message to be known for unpacking, so this single file is sufficient for sequential access.

Arrays & complex numbers

msgpack on its own does not support densely packed arrays or complex numbers. While it may be sufficient to use lists for heterogeneous data types or few values, datadings uses msgpack-numpy to support storing arbitrary numpy arrays efficiently. This introduces a limitation on the keys that can be present in samples.

Reserved keys

The following keys are reserved by datadings for internal use and thus cannot be used in samples:

  • "key": used to uniquely identify samples in a dataset

  • "nd": used by msgpack-numpy for array decoding

  • "complex": used by msgpack-numpy for complex number decoding

Using these keys results in undefined behavior.

Index

Datasets are indexed to enable fast sequential and random access. Previous versions of datadings created a monolithic index file that contained both keys and read offsets of samples. New-style indexes are made up of 4 separate files:

  1. .msgpack.offsets: uint64 start offsets for samples in the data file stored in network byte order, where offset[i] corresponds to the ith sample.

  2. .msgpack.keys: msgpacked list of keys.

  3. .msgpack.key_hashes: 8 byte salt, followed by 8 byte blake2s hashes of all keys. The salt is chosen to avoid hash collisions.

  4. .msgpack.filter: A Bloom filter for all keys in simplebloom format. It is setup to provide very low false-positive probabilities.

The advantage of this new style of index is that is allows for fast and lazy loading of elements as they are required. For typical datasets the keys file is several times larger than both offsets and key hashes, and both are several timers larger than the bloom filter. To check whether the dataset contains a key, only the filter and key hashes are required. The larger keys file itself is only loaded whenever a method returns the sample keys. Thus upon initialization the reader only checks for the presence of index files and warns if they are missing.

MD5 file

Finally, every dataset comes with a .msgpack.md5 file with hashes for the data and index files, so their integrity can be verified.

Limitations

Since msgpack is used datadings inherits its limitations.

  • Maps and lists cannot have more than 232-1 entries.

  • Strings and binary data cannot be longer than 232-1 bytes.

  • Integers (signed or unsigned) are limited to 64 bits.

  • Floats are limited to single or double precision.

This means each dataset is limited to less than 232 samples (since the index uses a map) and around 264 bytes total file size (the largest possible byte offset is 264-1). The same applies to each individual sample regarding the number of keys present and its packed size.

Legacy index file

Previous versions of datadings used a different index arrangement. An index file with the .msgpack.index extension contained a map of key-offset pairs. In Python notation:

{"sample 1": 0, "sample 2": 1234, ... }

For every key in the dataset it gives the offset in bytes of the respective sample from the beginning of the file. Index entries are stored with offsets in ascending order.