Data Generators

Rambutan uses two data generators, the training generator and the validation generator. Both take in regions of the genome, both one-hot encoded nucleotide sequence and bit encoded DNaseI sequence, and output a random sample of pairs of regions for the Rambutan model. Essentially, minibatches are created on the fly from 1D genome data because the nucleotide level input for all pairs in the genome cannot possibly fit in memory. The major difference between the two is that the training generator randomly produces minibatches over all chromosomes that it is fed, whereas the validation generator will systematically yield all positive samples once with an equal number of negative samples. This allows an entire chromosome to be used as a validation set while not double counting regions.

API Reference

The data generators are stored here. These generators produce the examples used for training a Rambutan model.

class rambutan.io.TrainingGenerator

Generator iterator, collects batches from a generator.

Parameters:
data : generator
batch_size : int

Batch Size

last_batch_handle : ‘pad’, ‘discard’ or ‘roll_over’

How to handle the last batch

provide_data

The name and shape of data provided by this iterator

provide_label

The name and shape of label provided by this iterator

class rambutan.io.ValidationGenerator

Generator iterator, collects batches from a generator showing a full subset.

Use on only one chromosome for now.

provide_data

The name and shape of data provided by this iterator

provide_label

The name and shape of label provided by this iterator