Utilities

These functions have been implemented to help users use the core Rambutan model. They consist mostly of data converters. For example, one can convert a FastA file to a one-hot encoded numpy array using the fasta_to_dense function, or calculation the insulation score of a predicted or real contact map using the insulation_score function.

API Reference

This file defines useful utility functions.

rambutan.utils.bedgraph_to_dense()

Read a bedgraph file and return a dense numpy array.

This will read in an arbitrary bedgraph file and turn it into a dense array for faster indexing in the future.

Parameters:
filename : str

The name of the bedgraph file to use.

Returns:
array : numpy.ndarray, shape=(n,)

A dense array of the unpacked values.

rambutan.utils.benjamini_hochberg()

Run the benjamini hochberg procedure on a vector of -sorted- p-values.

Runs the procedure on a vector of p-values, and returns the q-values for each point.

Parameters:
p_values : numpy.ndarray

A vector of p values

n : int

The number of tests which have been run.

Returns:
q_values : numpy.ndarray

The q-values for each point.

rambutan.utils.count_band_regions()

Calculate the number of regions in the band.

This will iterate over all region pairs and identify the number of region pairs within the given band. This corresponds to the number of tests.

Parameters:
regions : numpy.ndarray

The mappable regions in a chromosome.

min_distance : int, optional

The minimum distance that a contact must be within.

max_distance : int, optional

The maximum distance that a contact must be within.

Returns:
n : int

The number of region pairs in the band.

rambutan.utils.downsample()

Downsample a 1kb resolution matrix to a 5kb resolution matrix.

For each cell in the 5kb resolution matrix, take the maximum probability for each cell in the 5x5 grid at the 1kb resolution centered at this point. For example, the cell in the 5kb resolution matrix at 2500,2500 will take the maximum probability of the cells at 500,500, 500,1500, 500,2500… 4500,500, 4500,1500… etc. This is equivalent to treating the cells as being strongly correlated instead of independent from each other.

Parameters:
x : numpy.ndarray, shape=(n, n)

The 1kb resolution matrix to downsample

regions : numpy.ndarray, shape=(m,)

The relevant regions to look at

min_dist : int, optional

The minimum distance two regions have to be from each other to be considered. Default is 50kb.

max_dist : int, optional

The maximum distance two regions can be from each other to be considered. Default is 1Mb.

Returns:
y : numpy.ndarray, shape=(n/5, n/5)

The 5kb resolution matrix produced from the 1kb matrix.

rambutan.utils.encode_dnase()

Take in an array of real DNase values and binary encode the log.

This transforms the fold change value to the log fold value and then encodes this value as a binarization of the rounded log value. This is done to balance variance between enrichments and depletions.

For example, a log fold enrichment of 2.9 would have bits 0 1 2and 3 active, whereas a value of -2.2 would have bits 0, -1, and -2 active. This encodes DNase values between -2 and 5, so 8 total bits for each position.

Parameters:
dnase : numpy.ndarray, shape=(n,)

The dnase fold change values read from a bedgraph or bigwig file, ranging from near 0 to above.

Returns:
encoded_dnase : numpy.ndarray, shape=(n, 8)

The encoded log fold change values.

rambutan.utils.extract_contacts()

Extract the statistically significant contacts

Extract all contacts that have a p-value <= alpha and are within a certain band in the chromosome. This is useful for drastically reducing the size of the dataset. The columns must be named as follows:

chr1 fragmentMid1 chr2 fragmentMid2 p-value q-value

Parameters:
contacts : pandas.DataFrame

The data formatted properly in a dataframe.

alpha : double, optional

The p-value threshold to filter by. Default is 0.01.

min_distance : int, optional

The minimum distance that a contact must be within.

max_distance : int, optional

The maximum distance that a contact must be within.

Returns:
contacts : pandas.DataFrame

The filtered contacts within a certain threshold.

n_region_pairs : int

The number of region pairs in the band

rambutan.utils.extract_regions()

Extract the mappable regions for predcitions.

The mappable regions in this case are defined by those regions which have no unmappable (‘N’) nucleotides in the FASTA file.

Parameters:
sequence : numpy.ndarray, shape=(n, 4)

The one hot encoded sequence numpy array.

Returns:
regions : numpy.ndarray, shape=(m,)

The set of mappable regions (midpoints) from this file.

rambutan.utils.fasta_to_dense()

Translate the sequence from a file to a one hot encoded dense array.

Parameters:
filename : str

The name of the fasta file to use.

Returns:
array : numpy.ndarray, shape=(n, 4)

A dense array one-hot encoded for a nucleotide. ‘N’ does not take a value.

rambutan.utils.insulation_score()

Calculate the insulation score for a given matrix of any resolution.

This will slide a size*size square along the diagonal of the matrix, summing the values in the upper triangle of the matrix. If a region has no contacts it will not have an insulation score, which handles both edge cases.

Parameters:
x : numpy.ndarray, shape=(n, n)

The matrix to calculate the insulation score on. Either Rambutan predictions or contacts from a Hi-C map.

size : int, optional

The size of the square, default is 200, which is 1Mb on a 5kb resolution map.

Returns:
insulation : numpy.ndarray, shape=(n,)