Creating a custom dataset

Follow this guide if the dataset you want to use is not yet part of datadings. If the data is publicly available, please consider contributing it to datadings.

A basic example

Typically the process of converting individual samples into the datadings format is divided into locating/loading/pre-processing samples and writing them to the dataset file. Here’s a ready-to-run example that illustrates this:

import random
from datadings.writer import FileWriter


def generate_samples():
    for i in range(1000):
        data = i.to_bytes(10000, 'big')
        label = random.randrange(10)
        yield {'key': str(i), 'data': data, 'label': label}


def main():
    with FileWriter('dummy.msgpack') as writer:
        for sample in generate_samples():
            writer.write(sample)

The FileWriter should be used as a context manager to ensure that it is closed properly. Its write method accepts samples as dictionaries with a unique string key.

Converting directory trees

Apart from the featured MsgpackReader datadings also provides the DirectoryReader class to read samples from directory trees. Let’s assume your dataset is currently stored in a directory tree like this:

yourdataset / a / a1
                / a2
            / b / b3
                / b4

You can now simply replace the generate_samples function above with a DirectoryReader:

def main():
    with FileWriter('yourdataset.msgpack') as writer:
        for sample in DirectoryReader('yourdataset/{LABEL}/**'):
            writer.write(sample)

The names of the directories at the level marked by {LABEL} are used as label, the path to the file from the label onwards is used as the key, and the file contents are loaded into data:

{'key': 'a/a1', 'label': 0, 'path': 'yourdataset/a/a1',
 '_additional_info': [], '_label': 'a', 'data': b'content of a1'}
{'key': 'a/a2', 'label': 0, 'path': 'yourdataset/a/a2',
 '_additional_info': [], '_label': 'a', 'data': b'content of a2'}
{'key': 'b/b1', 'label': 1, 'path': 'yourdataset/b/b1',
 '_additional_info': [], '_label': 'b', 'data': b'content of b1'}
{'key': 'b/b2', 'label': 1, 'path': 'yourdataset/b/b2',
 '_additional_info': [], '_label': 'b', 'data': b'content of b2'}

You can now make any additional changes to the samples before handing them off to the writer. Check the reference for more details on how you can influence its behavior.

If your dataset is not a directory tree, but stored in a ZIP file you can use the ZipFileReader instead.

More complex datasets

If your dataset consists of multiple files per sample, needs additional metadata, or it is stored in an unusual way (like a single large TAR file that you don’t want to extract), you will need to write additional code to provide the samples. You can take a look at the source code of the included datasets like MIT1003 for pointers.