Contributing a dataset

To contribute a new dataset to datadings, please follow the steps below and create a merge request in our Gitlab repository.

Each dataset defines modules to read and write in the datadings.sets package. Typically the read module contains additional meta-data that is common for all samples, like class labels or distributions. The convention is that for a dataset called FOO, these modules are called FOO and FOO_write.

Metadata

Small amounts of data or data that is not available for download should be added to datadings directly. Examples for this are class labels/distributions/weights/colors, file lists, etc. Anything more than a 1 kiB should be included as zopfli or xz compressed text, JSON or msgpack files to reduce the size of the repository and distributed wheels. zopfli may give slightly better compression for very small files, while xz is vastly superior for larger files. Keep in mind that higher xz levels can require considerable amounts of memory for decompression. This shell script will try gzip -9, zopfli (if available), and all xz levels 0 to 9e and report file size and memory used (kiB) for decompression:

#!/bin/bash
set -e

echo -e "comp\tmemory\tsize"
bytes=$(stat -c %s "$FILE")
echo -e "none\t-\t$bytes"

gzip -9 -k -f "$FILE"
bytes=$(ls -l "$FILE.gz" | cut -d " " -f 5)
mem=$( 2>&1 /usr/bin/time -f "%M" gunzip -f -k "$FILE.gz")
echo -e "gzip -9\t$mem\t$bytes"

if command -v zopfli &> /dev/null; then
    zopfli -k -f "$FILE"
    bytes=$(ls -l "$FILE.gz" | cut -d " " -f 5)
    mem=$( 2>&1 /usr/bin/time -f "%M" gunzip -f -k "$FILE.gz")
    echo -e "zopfli\t$mem\t$bytes"
fi

for LEVEL in 0 0e 1 1e 2 2e 3 3e 4 4e 5 5e 6 6e 7 7e 8 8e 9 9e; do
    xz -$LEVEL -k -f "$FILE"
    bytes=$(ls -l "$FILE.xz" | cut -d " " -f 5)
    mem=$( 2>&1 /usr/bin/time -f "%M" unxz -f -k "$FILE.xz")
    echo -e "xz -$LEVEL\t$mem\t$bytes"
done

For example here’s the sorted output for ILSVRC2012_val.txt:

$ FILE=ILSVRC2012_val.txt ./testcomp.sh | sort -n -t $'\t' -k 3
comp    memory  size
xz -6   5068    123436
xz -7   4616    123436
xz -8   5140    123436
xz -9   3824    123436
xz -0e  2652    123540
xz -1e  3432    123540
xz -2e  4020    123540
xz -4e  4496    123540
xz -6e  5576    123540
xz -7e  5760    123540
xz -8e  6008    123540
xz -9e  5092    123540
xz -3e  4388    123680
xz -5e  4012    123680
xz -5   4320    125460
xz -2   3952    166608
xz -3   4016    167436
xz -1   3224    167828
xz -0   2720    168588
xz -4   3796    168924
zopfli  3120    201604
gzip -9 3292    229789
none    -       1644500

Surprisingly xz -0e is the clear winner here, giving excellent compression ratio with very low memory requirements.

Read module

Add a module called FOO to the datadings.sets package and add/load available meta-data, if any. With less complex datasets this module may be empty, but should be added anyway.

More complex datasets

For some datasets it simply does not make sense to convert them to the datadings file format. Typically conversion requires (at least temporarily) roughly twice the space as the original data. Other examples are be large video files that should really be streamed while decoding instead of loading all of the data at once, which datadings does not yet support. For these and similar cases it (at least currently) does not make sense to use the datadings msgpack format with the MsgpackReader. Instead, we recommend the FOO module provide a FOOReader class that extends datadings.reader.Reader or one of its subclasses. An effort should be made to reduce processing times. The FOOReader should read directly from the source files of the dataset and perform limited pre-processing. This slow process of analyzing every image was performed offline to speed up subsequent iterations of the dataset.

Write module

Now add another module called FOO_write. This will be an executable that writes dataset files. There are generally four steps to the writing process:

  • Argument parsing.

  • Download and verify source files.

  • Locate and load sample data.

  • Convert and write samples to dataset.

If you prefer to learn from code, the CAT2000_write module is a relatively simple, yet full-featured example.

Argument parsing

Scripts typically lean heavily on the datadings.argparse module to parse command line arguments. It provides utility functions like make_parser to create argument parsers with sensible default settings and a lot of commonly used arguments already added. For example, most datasets need an indir argument, where source files are located, as well as an optional outdir argument, which is where the dataset files will be written. The datadings.argparse module also provides functions to add a lesser-used arguments in a single line, including descriptive help text. By convention a function called argument_indir adds the indir argument to the given parser, including additional configuration and help text.

For a simple dataset with no additional arguments, a main function might begin like this:

def main():
    from ..argparse import make_parser
    from ..argparse import argument_threads
    from ..tools import prepare_indir

    parser = make_parser(__doc__)
    argument_threads(parser)
    args = parser.parse_args()
    outdir = args.outdir or args.indir

Download and verify

If possible, datadings should download source files. This is not possible for all datasets, because data might only be available on request or after registration. If that is the case, add a description to the docstring on how to download the data. Manual pre-processing steps, like unpacking archives, should be avoided if at all possible. The only exception to this rule is if Python is ill-equipped to handle the source file format, e.g., some unusual compression scheme like 7zip.

If downloading is possible, datadings provides some convenient tools to do so. First, define which files are required in a global variable called FILES, for example for the Pascal VOC 2012 dataset:

BASE_URL = 'http://saliency.mit.edu/'
FILES = {
    'train': {
        'path': 'trainSet.zip',
        'url': BASE_URL+'trainSet.zip',
        'md5': '56ad5c77e6c8f72ed9ef2901628d6e48',
    },
    'test': {
        'path': 'testSet.zip',
        'url': BASE_URL+'testSet.zip',
        'md5': '903ec668df2e5a8470aef9d8654e7985',
    }
}

Our example defines "train" and "test" files, with a relative path, a URL to download them from and and MD5 hash to verify their integrity. The verification step is especially important, since we want to support the reuse of previously downloaded files. So we need to make sure that the file we are using is actually what we expect to find.

This dictionary of file definitions can now be given to helper functions from the datadings.tools module. Most convenient is prepare_indir, which first attempts to download (if URL is given) and verify each file. If successful, it then returns a dict where all paths are replaced with the true location of each file.

Our main function now looks like this:

def main():
    from ..argparse import make_parser
    from ..argparse import argument_threads
    from ..tools import prepare_indir

    parser = make_parser(__doc__)
    argument_threads(parser)
    args = parser.parse_args()
    outdir = args.outdir or args.indir

    files = prepare_indir(FILES, args)

Locate, load, and write data

These steps heavily depend on the structure of the dataset and this guide can only provide general guidelines. We recommended to first define a generator, which loads and yields one sample at a time:

def yield_samples(stuff):
    samples = []  # find samples in stuff
    for sample in samples:
        # load data from source file
        yield SampleType(data, metadata, etc)

Instead of returning individual values it is recommended to use one of the provided type functions from datadings.sets.types. New types can be added if none of them fits your dataset. Type functions are generated by the generate_types.py script from the definitions in generate_types.json.

The generator is used by a write_set function, which is called once per split of the dataset. Here, create a FileWriter with the desired output path and pass samples to it:

def write_set(split, stuff, outdir, args):
    gen = yield_samples(stuff)
    outfile = pt.join(outdir, split + '.msgpack')
    writer = FileWriter(outfile, total=num_samples, overwrite=args.no_confirm)
    with writer:
        for sample in gen:
            writer.write(sample)

Important

Samples must have a unique "key". An exception will be raised if keys are repeated.

Note

If the overwrite parameter of the writer is False, the user will be prompted to overwrite an existing file. The user can now:

  • Accept to overwrite the file.

  • Decline, which raises a FileExistsError. The program should continue as if writing had finished.

  • Abort, which raises a KeyboardInterrupt. The program should abort immediately.

The default argument parser accepts a no_confirm argument, which is passed to the overwrite parameter.

The final function write_sets will call write_set once per split of the dataset:

def write_sets(files, outdir, args):
    for split in ('train', 'test'):
        try:
            write_set(split, files[split]['path'], outdir)
        except FileExistsError:
            continue
        except KeyboardInterrupt:
            break

Note

We catch the FileExistsError and KeyboardInterrupt, which may be raised by the writer.

The final main function now looks like this. We call write_sets and wrap the main function itself to catch keyboard interrupts by the user:

def main():
    from ..argparse import make_parser
    from ..tools import prepare_indir

    parser = make_parser(__doc__)
    args = parser.parse_args()
    outdir = args.outdir or args.indir

    files = prepare_indir(FILES, args)

    write_sets(files, outdir, args)


if __name__ == '__main__':
    try:
        main()
    except KeyboardInterrupt:
        pass
    finally:
        print()

Writing faster

Since datadings is all about speed and convenience, which are high related when it comes to writing datasets, you may want to optimize your program to increase the write speed. Two relatively simple optimizations are recommended to speed up the process.

First, the generator can be wrapped with the datadings.tools.yield_threaded() function, which runs the generator in a background thread. This effectively decouples filesystem read and write operations, but does not help if the CPU is the bottleneck.

In those cases where the bottleneck is neither reading nor writing, but a costly conversion (e.g., transcoding images), a thread or process pool can be used to parallelize this step:

def write_set(split, stuff, outdir, args):
    gen = yield_threaded(yield_samples(stuff))

    def costly_conversion(sample):
        # do something that you want parallelized
        return sample

    outfile = pt.join(outdir, split + '.msgpack')
    writer = FileWriter(outfile, total=num_samples, overwrite=args.no_confirm)
    pool = ThreadPool(args.threads)
    with writer:
        for sample in pool.imap_unordered(create_sample, gen):
            writer.write(sample)

Note

Add datadings.tools.argument_threads() to the parser to allow users to control the number of threads.

Note

imap_unordered makes no guarantees about the order of the returned samples. If the order is important, consider using a different method like imap. Beware though that this may use substantially more memory, as samples are stored in memory until they can be returned in the correct order.