datadings.reader.augment module

An Augment wraps a Reader <datadings.reader.reader.Reader and changes how samples are iterated over. How readers are used is largely unaffected.

class datadings.reader.augment.Cycler(reader)[source]

Bases: Augment

Infinitely cycle a Reader <datadings.reader.reader.Reader. Iterators can be requested with any start/stop index. Large indexes simply wrap around.

iter(start=None, stop=None, yield_key=False, raw=False, copy=True, chunk_size=16)[source]

Iterate over the dataset.

start and stop behave like the parameters of the range function0.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

start – start of range; if None, current index is used
stop – stop of range
yield_key – if True, yields (key, sample) pairs.
raw – if True, yields samples as msgpacked messages.
copy – if False, allow the reader to return data as memoryview objects instead of bytes
chunk_size – number of samples read at once; bigger values can increase throughput, but require more memory

Returns:

Iterator

class datadings.reader.augment.QuasiShuffler(reader, buf_size=0.01, seed=None)[source]

Bases: Augment

A slightly less random than a true Reader <datadings.reader.augment.Shuffler but much faster.

The dataset is divided into equal-size chunks that are read in random order. Shuffling follows these steps:

Fill the buffer with random chunks.
Read the next random chunk.
Select a random sample from the buffer and yield it.
Replace the sample with the next sample from the current chunk.
If there are chunks left, goto 2.
Shuffle the buffer and yield its contents.

This means there are typically more samples from the current chunk in the buffer than there would be if a true shuffle was used. This effect is more pronounced for smaller fractions \(\frac{B}{C}\) where \(C\) is the chunk size and \(B\) the buffer size. As a rule of thumb it is sufficient to keep \(\frac{B}{C}\) roughly equal to the number of classes in the dataset.

Note

Creating a new iterator, especially from a specific start position, is a costly operation. If possible create one iterator and use it until it is exhausted.

Parameters:

reader – the reader to wrap
buf_size – size of the buffer; values less than 1 are interpreted as fractions of the dataset length; bigger values improve randomness, but use more memory
seed – random seed to use; defaults to len(reader) * buf_size * chunk_size

find_index(key)[source]: Returns the index of the sample with the given key.

find_key(index)[source]: Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Sample as index.

seed(seed)[source]

slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Iterator of selected samples

class datadings.reader.augment.Range(reader, start=0, stop=None)[source]

Bases: Augment

Extract a range of samples from a given reader.

start and stop behave like the parameters of the :python:`range` function.

Parameters:

reader – reader to sample from
start – start of range
stop – stop of range

find_index(key)[source]: Returns the index of the sample with the given key.

find_key(index)[source]: Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Sample as index.

slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Iterator of selected samples

class datadings.reader.augment.Repeater(reader, times)[source]

Bases: Augment

Repeat a Reader <datadings.reader.reader.Reader a fixed number of times.

Note

find_index returns the first occurrence.

find_index(key)[source]: Returns the index of the sample with the given key.

find_key(index)[source]: Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Sample as index.

slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Iterator of selected samples

class datadings.reader.augment.Shuffler(reader, seed=None)[source]

Bases: Augment

Iterate over a Reader <datadings.reader.reader.Reader in random order. If no seed is given the length of the reader is used for reproducibility. Creating an iterator increments the seed by 1. Use Shuffler.seed() to set the desired seed instead.

Warning

Shuffler only implements iteration. Random access methods find_index, find_key, get, and slice raise NotImplementedError.

Parameters:

reader – The reader to augment.
seed – optional random seed; defaults to len(reader)

Warning

Augments are not thread safe!

find_index(key)[source]: Returns the index of the sample with the given key.

find_key(index)[source]: Returns the key of the sample with the given index.

get(index, yield_key=False, raw=False, copy=True)[source]

Returns sample at given index.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

index – Index of the sample
yield_key – If True, returns (key, sample)
raw – If True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Sample as index.

seed(seed)[source]

slice(start, stop=None, yield_key=False, raw=False, copy=True)[source]

Returns a generator of samples selected by the given slice.

copy=False allows the reader to use zero-copy mechanisms. Data may be returned as memoryview objects rather than bytes. This can improve performance, but also drastically increase memory consumption, since one sample can keep the whole slice in memory.

Parameters:

start – start index of slice
stop – stop index of slice
yield_key – if True, yield (key, sample)
raw – if True, returns sample as msgpacked message
copy – if False, allow the reader to return data as memoryview objects instead of bytes

Returns:

Iterator of selected samples