datadings’ documentation

datadings is a collection of tools to prepare datasets for machine learning, based on two simple principles:

Datasets are collections of individual data samples.

Each sample is a dictionary with descriptive keys.

For supervised training with images samples are dictionaries like this:

{"key": unique_key, "image": imagedata, "label": label}

Mission statement

Dealing with different datasets can be tedious for machine learning practitioners. Two datasets almost never share the same directory structure and often custom file formats are used. How datadings fits into the picture is best explained by XKCD #927:

XKCD #927

Slightly less cynically, datadings aims to make dealing with datasets fast and easy. datadings currently supports over 20 different datasets for image classification, segmentation, saliency prediction, and remote sensing. pip install datadings and use the datadings-write command (datadings-write -h for more info) to download the source files for any of the included datasets and convert them to the datadings format. And since it’s based on the excellent msgpack, a JSON-like format that supports binary data, it’s space-efficient, blazingly fast, does not use schema, and has support for over 50 programming languages and environments. You are also not limited to any specific learning framework, only Python if you want to use additional tools provided by datadings.

Fast, you say?

The ImageNet dataset (the ILSVRC2012 challenge dataset, to be precise) is prominently featured in many scientific publications. Tutorials on how to train models with it usually recommended unpacking the large training and validation set tar files into separate folders. There are now roughly 1.3 million tiny files you need to load per epoch of training. This is bad for HDDs and doubly bad if you access them over the network. While datadings supports reading from datasets like these with the DirectoryReader, it will only read with a leisurely pace of about 500 samples/s. Reading the whole training set takes about 40 minutes. This is not fast enough for modern GPUs.

Once converted into the datadings format, you can easily saturate 10G ethernet reading well over 20000 samples/s using the MsgpackReader.

It also takes several seconds to start reading from the directory tree, whereas reading from msgpack files is almost instant. This makes debugging a breeze. Check out the file format description description if you want to know how this is achieved.

TL;DR

pip install datadings and use the datadings-write command to create the dataset files (datadings-write -h for more info). It creates a dataset.msgpack file. In your code, open this file with the MsgpackReader. You can now iterate over it:

from datadings.reader import MsgpackReader
with MsgpackReader('dataset.msgpack') as reader:
    for sample in reader:
        [do dataset things]

Indices and tables