datadings.sets.YFCC100m module

The Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset.

Important

Only images are included. No videos or metadata.

Warning

This code is intended to load a pre-release version of the YFCC100m dataset. Please complain if you want to use the release version available from amazon: https://multimediacommons.wordpress.com/yfcc100m-core-dataset/

Important

Samples have the following keys:

  • "key"

  • "image"

class datadings.sets.YFCC100m.DevNull[source]

Bases: object

close()[source]
read(*_)[source]
write(*_)[source]
class datadings.sets.YFCC100m.YFCC100mReader(image_packs_dir, validator=<function noop>, reject_file_paths=('/home/docs/checkouts/readthedocs.org/user_builds/datadings/checkouts/latest/datadings/sets/YFCC100m_rejected_images.msgpack.xz', ), error_file=None, error_file_mode='a')[source]

Bases: Reader

Special reader for the YFCC100m dataset only. It reads images from 10000 ZIP files of roughly 10000 images each.

One pass over the whole dataset was made to filter out irrelevant images if one of the following conditions is met:

  • Image is damaged/incomplete.

  • Less than 2600 bytes.

  • Exactly 9218 bytes - a placeholder image from Flickr.

  • Less than 20000 bytes and less than 5% of lines in the image have a variance less than 50.

Which images are rejected is controlled by the files given as reject_file_paths. Set this to None or empty list to iterate over the whole dataset.

Parameters:
  • image_packs_dir – Path to directory with image ZIP files.

  • validator – Callable validator(data: bytes) -> Union[bytes, None]. Validates images before they are returned. Receives image data and returns data or None.

Warning

A validating reader cannot be copied and it is strongly discourages to copy readers with error_file paths.

Warning

Methods``get``, slice, find_index, find_key, seek_index, and seek_key are considerably slower for this reader compared to others. Use iterators and large slice ranges instead.

find_index(key)[source]

Returns the index of the sample with the given key.

find_key(index)[source]

Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:
  • index – Index of the sample

  • yield_key – If True, returns (key, sample)

  • raw – If True, returns sample as msgpacked message

  • copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Sample as index.

open_error_file_()[source]
slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:
  • start – start index of slice

  • stop – stop index of slice

  • yield_key – if True, yield (key, sample)

  • raw – if True, returns sample as msgpacked message

  • copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Iterator of selected samples

datadings.sets.YFCC100m.decode_fast(data)[source]
datadings.sets.YFCC100m.main()[source]
datadings.sets.YFCC100m.noop(data)[source]
datadings.sets.YFCC100m.validate_image(data)[source]