Rambutan

This file contains the Rambutan model. It is composed of two functions, fit and predict.

The predict method will use a pre-trained version of Rambutan and predict all pairs of regions in a given sequence file. It takes in nucleotide sequence and DNaseI sensitivity either as bit encoded numpy arrays, or as FastA and bedgraph files respectively. Note that the DNaseI sensitivity should be the fold change signal, not the log10 p-values or raw values. If FastA or bedgraph files are passed in, they are converted internally to the bit encoding. Here is an example of it’s usage:

from rambutan import Rambutan

model = Rambutan('rambutan', 25, verbose=True)
y_pred = model.predict('chr21.fa', 'E003-DNase.chr21.fc.signal.bedgraph', ctxs=[0, 1, 2, 3])

This assumes that the directory that you are running the code has the Rambutan model and parameter files (for the 25th iterator), the FastA file for chr21, and the appropriate bedgraph file. It will also use gpus 0, 1, 2, and 3 to make the predictions. If you have fewer gpus, pass in fewer numbers.

The fit function will accept training (and optionally validation) data either as FastA/bedgraph/HiC files, or as numpy array/numpy array/HiC files. It will automatically convert the FastA and bedgraph files to the bit encoding internally. However, this main require a great deal of memory, so it is recommended that you preprocess your data to numpy arrays, store them to disk, and then feed them in as memory maps so that the entire genome does not need to be stored in RAM for Rambutan to be trained.

Here is an example of how to pre-encode your data:

from rambutan.utils import fasta_to_dense
from rambutan.utils import bedgraph_to_dense
from rambutan.utils import encode_dnase

# Process the sequence files first
sequence_files = ['chr{}.fa'.format(i) for i in range(1, 23)]
for i, sequence_file in enumerate(sequence_files):
        sequence = fasta_to_dense(sequence_file, verbose=True)
        numpy.save('chr{}.ohe.npy'.format(i), sequence)

# Process the DNaseI files next
dnase_files = ['E003-DNase.chr{}.fc.signal.bedgraph'.format(i) for i in range(1, 23)]
for i, dnase_file in enumerate(dnase_files):
        dnase = bedgraph_to_dense(dnase_file, verbose=True)
        dnase_encoded = encode_dnase(dnase, verbose=True)
        numpy.save('chr{}.dnase.npy'.format(i), dnase_encoded)

After running this you will now have the nucleotide sequence and DNaseI inputs bit encoded as used in the Rambutan model. You can now load them up as memory maps, which essentially are pointers to where the data lives on disk. This means that the entire array doesn’t get loaded into memory, but only subsections are loaded as needed. This will slow down training time, but can be mitigated if a solid state drive is used. Here is an example of fitting using memory maps:

train_contacts = 'GM12878_combined.res1000.contacts.txt.gz'
train_sequence = numpy.array([ numpy.load('chr{}.ohe.npy'.format(i), mmap_mode='r') for i in chromosomes ])
train_dnases   = numpy.array([ numpy.load('chr{}.dnase.npy'.format(i), mmap_mode='r') for i in chromosomes ])

model = Rambutan()
model.fit(train_sequence, train_dnase, train_contacts, ctxs=[0, 1, 2, 3])

This will fit the Rambutan model to the given data using four GPUs. Fewer or more GPUs can be used as long as they are in the same machine. Distributed learning is not yet available for training Rambutan models.

API Reference

class rambutan.rambutan.Rambutan(name='rambutan', iteration=None, model=None, learning_rate=0.01, num_epoch=25, epoch_size=500, wd=0.0, optimizer='adam', batch_size=1024, min_dist=50000, max_dist=1000000, use_seq=True, use_dnase=True, use_dist=True, verbose=True)[source]

Rambutan: a predictor of mid-range DNA-DNA contacts.

This serves as a wrapper for all functionality involving the use of Rambutan. There are two main functions to use, fit and predict. Fit involves taking in nucleotide sequence, DNaseI sensitivity, and a contact map, and training the model. Predict involves taking in nucleotide sequence and DNaseI sensitivity and predicting significant contacts.

Note: Due to a limitation of mxnets part, you cannot fit and predict in the same program. You must fit the model and save the parameters during training, and then load the pre-fit model and make predictions.

Parameters:
name : str, optional

The name of the model, necessary for saving or loading parameters. Default is ‘rambutan’.

iteration : int or None, optional

The iteration of training to load model parameters from, if using Rambutan in predict mode. Default is None.

model : mxnet.symbol or None

An alternate neural network can be passed in if one wishes to train that using the same framework instead of the original Rambutan model.

learning_rate : float, optional

The learning rate for the optimizer. Default is 0.01.

num_epoch : int, optional

The number of epochs to train the model for. Default is 25.

epoch_size : int, optional

The number of batches which comprise an ‘epoch’. Default is 500.

wd : float, optional

The weight decay. This is equivalent to L2 regularization. Default is 0.0.

optimizer : str, optional

The optimizer to use for training. Default is ‘adam’.

batch_size : int, optional

The number of samples to use in each batch. Default is 1024.

min_dist : int, optional

The minimum distance to consider contacts for. Default is 50kb.

max_dist : int, optional

The maximum distance to consider contacts for. Default is 1mb.

use_seq : bool, optional

Whether to use nucleotide sequence as an input to the model in the training step. Default is True.

use_dnase : bool, optional

Whether to use DNaseI sensitivity as an input to the model in the training step. Default is True.

use_dist : bool, optional

Whether to use genomic distance as an input to the model in the training step. Default is True.

verbose : bool, optional

Whether to output information during training and prediction. Default is True.

fit(sequence, dnase, contacts, regions=None, validation_contacts=None, training_chromosome=None, validation_chromosome=None, ctxs=[0], eval_metric='acc', symbol=None, n_jobs=1)[source]

Fit the model to sequence, DNaseI, and Hi-C data.

You can fit the Rambutan model to new data. One must pass in sequence data, DNaseI data, and Hi-C contact maps. The sequence data can come either in the form of FastA files or one-hot encoded numpy arrays. The DNaseI data can likewise come as either bedgraph files or numpy arrays. The Hi-C data must come in the traditional 7 column format. Validation data can optionally be passed in to report a validation set error during the training process. NOTE: Regardless of if they are used or not, all chromosomes should be passed in to the sequence and dnase parameters. The contacts specified in contacts will dictate which are used. This is to make the internals easier.

Parameters for training such as the number of epochs and batches are set in the initial constructor, following with the sklearn format for estimators.

Parameters:
sequence : numpy.ndarray, shape (n, 4) or str

The nucleotide sequence. Either a one hot encoded matrix of nucleotides with n being the size of the chromosome, or a file name for a fasta file.

dnase : numpy.ndarray, shape (n, 8) or str

The DNaseI fold change sensitivity. Either an encoded matrix in the manner described in the manuscript or the file name of a bedgraph file.

regions : numpy.ndarray or None, optional

The regions of interest to look at. All other regions will be ignored. If set to none, the regions of interest are defined to be 1kb bins for which all nucleotides are mappable, i.e. where there are no n or N symbols in the fasta file. Default is None.

ctxs: list, optional

The contexts of the gpus to use for prediction. Currently prediction is only supported on gpus and not cpus due to the time it would take for prediction. For example, if you wanted to use three gpus of index 0 1 and 3 (because 2 is busy doing something else) you would just pass in ctxs=[0, 1, 3] and the prediction task will be naturally parallelized across your 3 gpus with a linear speedup.

Returns:
y : numpy.ndarray, shape=(m, m)

A matrix of predictions of shape (m, m) where m is the number of 1kb loci in the chromosome. The predictions will reside in the upper triangle of the matrix since predictions are symmetric.

predict(sequence, dnase, regions=None, ctxs=[0], sparse=False)[source]

Make predictions and return the matrix of probabilities.

Rambutan will make a prediction for each pair of genomic loci defined in `regions’ which fall between `min_dist’ and `max_dist’. Inputs can either be appropriately encoded sequence and dnase files, or fasta files and bedgraph files for the nucleotide sequence and DNaseI sensitivity respectively. Note: fasta files and bedgraph files must be made up of a single chromosome, not one entry per chromosome.

Parameters:
sequence : numpy.ndarray, shape (n, 4) or str

The nucleotide sequence. Either a one hot encoded matrix of nucleotides with n being the size of the chromosome, or a file name for a fasta file.

dnase : numpy.ndarray, shape (n, 8) or str

The DNaseI fold change sensitivity. Either an encoded matrix in the manner described in the manuscript or the file name of a bedgraph file.

regions : numpy.ndarray or None, optional

The regions of interest to look at. All other regions will be ignored. If set to none, the regions of interest are defined to be 1kb bins for which all nucleotides are mappable, i.e. where there are no n or N symbols in the fasta file. Default is None.

ctxs : list, optional

The contexts of the gpus to use for prediction. Currently prediction is only supported on gpus and not cpus due to the time it would take for prediction. For example, if you wanted to use three gpus of index 0 1 and 3 (because 2 is busy doing something else) you would just pass in ctxs=[0, 1, 3] and the prediction task will be naturally parallelized across your 3 gpus with a linear speedup.

sparse : bool, optional

Whether to return three arrays, the rows, columns, and values, or the full dense matrix. Sparse is useful for large matrices.

Returns:
y : numpy.ndarray, shape=(m, m)

A matrix of predictions of shape (m, m) where m is the number of 1kb loci in the chromosome. The predictions will reside in the upper triangle of the matrix since predictions are symmetric.

rambutan.rambutan.extract_dnase(filename, verbose=False)[source]

Extract a DNaseI file and encode it.

This function will read in a bedgraph format file and convert it to the one-hot encoded numpy array used internally. If a one-hot encoded file is passed in, it is simple returned. This function is a convenient wrapper for joblib to parallelize the unzipping portion.

Parameters:
filename : str or numpy.ndarray

The name of the bedgraph file to open or the one-hot encoded sequence.

verbose: bool, optional

Whether to report the status while extracting sequence. This does not look good when done in parallel, so it is suggested it is set to false in that case.

Returns:
sequence : numpy.ndarray, shape=(n, 8)

The one-hot encoded DNaseI sequence.

rambutan.rambutan.extract_sequence(filename, verbose=False)[source]

Extract a nucleotide sequence from a file and encode it.

This function will read in a FastA formatted DNA file and convert it to be a one-hot encoded numpy array for internal use. If a one-hot encoded file is passed in, it is simply returned. This function is a convenient wrapper for joblib to parallelize the unzipping portion.

Parameters:
filename : str or numpy.ndarray

The name of the fasta file to open or the one-hot encoded sequence.

verbose: bool, optional

Whether to report the status while extracting sequence. This does not look good when done in parallel, so it is suggested it is set to false in that case.

Returns:
sequence : numpy.ndarray, shape=(n, 4)

The one-hot encoded DNA sequence.