datadings.sets.YFCC100m module

The Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset.

Important

Only images are included. No videos or metadata.

Warning

This code is intended to load a pre-release version of the YFCC100m dataset. Please complain if you want to use the release version available from amazon: https://multimediacommons.wordpress.com/yfcc100m-core-dataset/

Important

Samples have the following keys:

  • "key"

  • "image"

class datadings.sets.YFCC100m.DevNull[source]

Bases: object

close()[source]
read(*_)[source]
write(*_)[source]
class datadings.sets.YFCC100m.YFCC100mReader(image_packs_dir, validator=<function noop>, reject_file_paths=('/home/docs/checkouts/readthedocs.org/user_builds/datadings/checkouts/stable/datadings/sets/YFCC100m_rejected_images.msgpack.xz', ), error_file=None, error_file_mode='a')[source]

Bases: datadings.reader.reader.Reader

Special reader for the YFCC100m dataset only. It reads images from 10000 ZIP files of roughly 10000 images each.

One pass over the whole dataset was made to filter out irrelevant images if one of the following conditions is met:

  • Image is damaged/incomplete.

  • Less than 2600 bytes.

  • Exactly 9218 bytes - a placeholder image from Flickr.

  • Less than 20000 bytes and less than 5% of lines in the image have a variance less than 50.

Which images are rejected is controlled by the files given as reject_file_paths. Set this to None or empty list to iterate over the whole dataset.

Parameters
  • image_packs_dir – Path to directory with image ZIP files.

  • validator – Callable validator(data: bytes) -> Union[bytes, None]. Validates images before they are returned. Receives image data and returns data or None.

Warning

A validating reader cannot be copied and it is strongly discourages to copy readers with error_file paths.

Warning

Methods``get``, slice, find_index, find_key, seek_index, and seek_key are considerably slower for this reader compared to others. Use iterators and large slice ranges instead.

find_index(key)[source]

Returns the index of the sample with the given key.

find_key(index)[source]

Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters
  • index – Index of the sample

  • yield_key – If True, returns (key, sample)

  • raw – If True, returns sample as msgpacked message

  • copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns

Sample as index.

next()[source]

Returns the next sample.

This can be slow for file-based readers if a lot of samples are to be read. Consider using iter instead:

it = iter(reader)
while 1:
    next(it)
    ...

Or simply loop over the reader:

for sample in reader:
    ...
open_error_file_()[source]
rawnext()[source]

Return the next sample msgpacked as raw bytes.

This can be slow for file-based readers if a lot of samples are to be read. Consider using iter instead:

it = iter(reader)
while 1:
    next(it)
    ...

Or simply loop over the reader:

for sample in reader:
    ...

Included for backwards compatibility and may be deprecated and subsequently removed in the future.

seek_index(index)[source]

Seek to the given index.

seek_key(key)[source]

Seek to the sample with the given key.

slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters
  • start – start index of slice

  • stop – stop index of slice

  • yield_key – if True, yield (key, sample)

  • raw – if True, returns sample as msgpacked message

  • copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns

Iterator of selected samples

datadings.sets.YFCC100m.decode_fast(data)[source]
datadings.sets.YFCC100m.main()[source]
datadings.sets.YFCC100m.noop(data)[source]
datadings.sets.YFCC100m.validate_image(data)[source]