datadings.sets.YFCC100m module
The Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset.
Important
Only images are included. No videos or metadata.
Warning
This code is intended to load a pre-release version of the YFCC100m dataset. Please complain if you want to use the release version available from amazon: https://multimediacommons.wordpress.com/yfcc100m-core-dataset/
Important
Samples have the following keys:
"key"
"image"
- class datadings.sets.YFCC100m.YFCC100mReader(image_packs_dir, validator=<function noop>, reject_file_paths=('/home/docs/checkouts/readthedocs.org/user_builds/datadings/checkouts/latest/datadings/sets/YFCC100m_rejected_images.msgpack.xz', ), error_file=None, error_file_mode='a')[source]
Bases:
Reader
Special reader for the YFCC100m dataset only. It reads images from 10000 ZIP files of roughly 10000 images each.
One pass over the whole dataset was made to filter out irrelevant images if one of the following conditions is met:
Image is damaged/incomplete.
Less than 2600 bytes.
Exactly 9218 bytes - a placeholder image from Flickr.
Less than 20000 bytes and less than 5% of lines in the image have a variance less than 50.
Which images are rejected is controlled by the files given as
reject_file_paths
. Set this to None or empty list to iterate over the whole dataset.- Parameters:
image_packs_dir – Path to directory with image ZIP files.
validator – Callable
validator(data: bytes) -> Union[bytes, None]
. Validates images before they are returned. Receives image data and returns data orNone
.
Warning
A validating reader cannot be copied and it is strongly discourages to copy readers with
error_file
paths.Warning
Methods``get``,
slice
,find_index
,find_key
,seek_index
, andseek_key
are considerably slower for this reader compared to others. Use iterators and largeslice
ranges instead.- get(index, yield_key=False, raw=False, copy=True)[source]
Returns sample at given index.
copy=False
allows the reader to use zero-copy mechanisms. Data may be returned asmemoryview
objects rather thanbytes
. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.- Parameters:
index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as
memoryview
objects instead ofbytes
- Returns:
Sample as index.
- slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]
Returns a generator of samples selected by the given slice.
copy=False
allows the reader to use zero-copy mechanisms. Data may be returned asmemoryview
objects rather thanbytes
. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.- Parameters:
start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as
memoryview
objects instead ofbytes
- Returns:
Iterator of selected samples