datadings.sets.YFCC100m module¶
The Yahoo Flickr Creative Commons 100 Million (YFCC100m) dataset.
Important
Only images are included. No videos or metadata.
Warning
This code is intended to load a pre-release version of the YFCC100m dataset. Please complain if you want to use the release version available from amazon: https://multimediacommons.wordpress.com/yfcc100m-core-dataset/
Important
Samples have the following keys:
"key"
"image"
- class datadings.sets.YFCC100m.YFCC100mReader(image_packs_dir, validator=<function noop>, reject_file_paths=('/home/docs/checkouts/readthedocs.org/user_builds/datadings/checkouts/stable/datadings/sets/YFCC100m_rejected_images.msgpack.xz', ), error_file=None, error_file_mode='a')[source]¶
Bases:
datadings.reader.reader.Reader
Special reader for the YFCC100m dataset only. It reads images from 10000 ZIP files of roughly 10000 images each.
One pass over the whole dataset was made to filter out irrelevant images if one of the following conditions is met:
Image is damaged/incomplete.
Less than 2600 bytes.
Exactly 9218 bytes - a placeholder image from Flickr.
Less than 20000 bytes and less than 5% of lines in the image have a variance less than 50.
Which images are rejected is controlled by the files given as
reject_file_paths
. Set this to None or empty list to iterate over the whole dataset.- Parameters
image_packs_dir – Path to directory with image ZIP files.
validator – Callable
validator(data: bytes) -> Union[bytes, None]
. Validates images before they are returned. Receives image data and returns data orNone
.
Warning
A validating reader cannot be copied and it is strongly discourages to copy readers with
error_file
paths.Warning
Methods``get``,
slice
,find_index
,find_key
,seek_index
, andseek_key
are considerably slower for this reader compared to others. Use iterators and largeslice
ranges instead.- get(index, yield_key=False, raw=False, copy=True)[source]¶
Returns sample at given index.
copy=False
allows the reader to use zero-copy mechanisms. Data may be returned asmemoryview
objects rather thanbytes
. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.- Parameters
index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as
memoryview
objects instead ofbytes
- Returns
Sample as index.
- next()[source]¶
Returns the next sample.
This can be slow for file-based readers if a lot of samples are to be read. Consider using iter instead:
it = iter(reader) while 1: next(it) ...
Or simply loop over the reader:
for sample in reader: ...
- rawnext()[source]¶
Return the next sample msgpacked as raw bytes.
This can be slow for file-based readers if a lot of samples are to be read. Consider using iter instead:
it = iter(reader) while 1: next(it) ...
Or simply loop over the reader:
for sample in reader: ...
Included for backwards compatibility and may be deprecated and subsequently removed in the future.
- slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]¶
Returns a generator of samples selected by the given slice.
copy=False
allows the reader to use zero-copy mechanisms. Data may be returned asmemoryview
objects rather thanbytes
. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.- Parameters
start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as
memoryview
objects instead ofbytes
- Returns
Iterator of selected samples