File format
The file format datadings uses is simple. A dataset is made up of six files:
.msgpack
main data file.msgpack.offsets
sample start offset file.msgpack.keys
key file.msgpack.key_hashes
key hash file.msgpack.filter
Bloom filter file.msgpack.md5
for integrity checks
Data file
Each sample is a key-value
map
with a string key
that is unique
for the dataset.
In Python notation:
{"key": <unique key>, ... }
Each sample is represented by one msgpack
message.
The main dataset file has the extension .msgpack
and contains
a sequence of these messages:
<sample 1><sample 2><sample 3> ...
Note that no additional data is stored in this file. The msgpack format does not require the length of the message to be known for unpacking, so this single file is sufficient for sequential access.
Arrays & complex numbers
msgpack on its own does not support densely packed arrays or complex numbers. While it may be sufficient to use lists for heterogeneous data types or few values, datadings uses msgpack-numpy to support storing arbitrary numpy arrays efficiently. This introduces a limitation on the keys that can be present in samples.
Reserved keys
The following keys are reserved by datadings for internal use and thus cannot be used in samples:
"key"
: used to uniquely identify samples in a dataset"nd"
: used by msgpack-numpy for array decoding"complex"
: used by msgpack-numpy for complex number decoding
Using these keys results in undefined behavior.
Index
Datasets are indexed to enable fast sequential and random access. Previous versions of datadings created a monolithic index file that contained both keys and read offsets of samples. New-style indexes are made up of 4 separate files:
.msgpack.offsets
: uint64 start offsets for samples in the data file stored in network byte order, where offset[i] corresponds to the ith sample..msgpack.keys
: msgpacked list of keys..msgpack.key_hashes
: 8 byte salt, followed by 8 byte blake2s hashes of all keys. The salt is chosen to avoid hash collisions..msgpack.filter
: A Bloom filter for all keys in simplebloom format. It is setup to provide very low false-positive probabilities.
The advantage of this new style of index is that is allows for fast and lazy loading of elements as they are required. For typical datasets the keys file is several times larger than both offsets and key hashes, and both are several timers larger than the bloom filter. To check whether the dataset contains a key, only the filter and key hashes are required. The larger keys file itself is only loaded whenever a method returns the sample keys. Thus upon initialization the reader only checks for the presence of index files and warns if they are missing.
MD5 file
Finally, every dataset comes with a .msgpack.md5
file with hashes
for the data and index files, so their integrity can be verified.
Limitations
Since msgpack is used datadings inherits its limitations.
Maps and lists cannot have more than 232-1 entries.
Strings and binary data cannot be longer than 232-1 bytes.
Integers (signed or unsigned) are limited to 64 bits.
Floats are limited to single or double precision.
This means each dataset is limited to less than 232 samples (since the index uses a map) and around 264 bytes total file size (the largest possible byte offset is 264-1). The same applies to each individual sample regarding the number of keys present and its packed size.
Legacy index file
Previous versions of datadings used a different index arrangement.
An index file with the .msgpack.index
extension contained a
map of key-offset pairs.
In Python notation:
{"sample 1": 0, "sample 2": 1234, ... }
For every key
in the dataset it gives the offset in bytes of
the respective sample from the beginning of the file.
Index entries are stored with offsets in ascending order.