=== SECOMO API ===

Convolutional restricted Boltzmann machine (cRBM)¶

CRBM contains the main functionality of SECOMO for training, evaluating and investigating models.

`CRBM.fit`(training_data[, test_data])	Fits the cRBM to the provided training sequences.
`CRBM.freeEnergy`(data)	Free energy determined on the given dataset.
`CRBM.motifHitProbs`(data)	Motif match probabilities.
`CRBM.getPFMs`()	Returns the weight matrices converted to position frequency matrices.
`CRBM.saveModel`(filename)	Save the model parameters and additional hyper-parameters.
`CRBM.loadModel`(filename)	Load a model from a given pickle file.

class secomo.CRBM(num_motifs, motif_length, epochs=100, input_dims=4, doublestranded=True, batchsize=20, learning_rate=0.1, momentum=0.95, pooling=1, cd_k=5, rho=0.01, lambda_rate=0.1)[source]¶

CRBM class.

The class CRBM implements functionality for a convolutional restricted Boltzmann machine (cRBM) that extracts redundant DNA sequence features from a provided set of sequences. The model can subsequently be used to study the sequence content of (e.g. regulatory) sequences, by visualizing the features in terms of sequence logos or in order to cluster the sequences based on sequence content.

num_motifs : int: Number of motifs.
motif_length : int: Motif length.
epochs : int: Number of epochs to train (Default: 100).
input_dims :int: Input dimensions aka alphabet size (Default: 4 for DNA).
doublestranded : bool: Single strand or both strands. If set to True, both strands are scanned. (Default: True).
batchsize : int: Batch size (Default: 20).
learning_rate : float): Learning rate (Default: 0.1).
momentum : float: Momentum term (Default: 0.95).
pooling : int: Pooling factor (not relevant for cRBM, but for future work) (Default: 1).
cd_k : int: Number of Gibbs sampling iterations in each persistent contrastive divergence step (Default: 5).
rho : float: Target frequency of motif occurrences (Default: 0.01).
lambda_rate : float: Sparsity enforcement aka penality term (Default: 0.1).

fit(training_data, test_data=None)[source]¶

Fits the cRBM to the provided training sequences.

training_data : numpy-array: 4D-Numpy array representing the training sequence in one-hot encoding. See crbm.sequences.seqToOneHot().
test_data : numpy-array: 4D-Numpy array representing the validation sequence in one-hot encoding. If no test_data is provided, the training progress will be reported on the training set itself. See crbm.sequences.seqToOneHot().

freeEnergy(data)[source]¶

Free energy determined on the given dataset.

data : numpy-array: 4D numpy array representing a DNA sequence in one-hot encoding. See crbm.sequences.seqToOneHot().
returns : numpy-array: Free energy per sequence.

getPFMs()[source]¶

Returns the weight matrices converted to position frequency matrices.

returns: numpy-array: List of position frequency matrices as numpy arrays.

classmethod loadModel(filename)[source]¶

Load a model from a given pickle file.

filename : str: Pickle file containing the model parameters.
returns : CRBM object: An instance of CRBM with reloaded parameters.

motifHitProbs(data)[source]¶

Motif match probabilities.

data : numpy-array: 4D numpy array representing a DNA sequence in one-hot encoding. See crbm.sequences.seqToOneHot().
returns : numpy-array: Per-position motif match probabilities of all motifs as numpy array.

saveModel(filename)[source]¶

Save the model parameters and additional hyper-parameters.

filename : str: Pickle filename where the model parameters are stored.

Utils¶

This part presents functions contained in secomo.utils that help you investigate the results of a trained SECOMO model. It features generating position frequency matrices, sequence logos and clustering plots.

`saveMotifs`(model, path[, name, fformat])	Save weight-matrices as PFMs.
`createSeqLogos`(model, path[, fformat])	Create sequence logos for all cRBM motifs
`createSeqLogo`(pfm, filename[, fformat])	Create sequence logo for an individual cRBM motif.
`positionalDensityPlot`(model, seqs[, filename])	Positional enrichment of the motifs.
`runTSNE`(model, seqs)	Run t-SNE on the motif abundances.
`tsneScatter`(data[, lims, colors, filename, ...])	Scatter plot of t-SNE clustering.
`tsneScatterWithPies`(model, seqs, tsne[, ...])	Scatter plot of t-SNE clustering.
`violinPlotMotifMatches`(model, data[, filename])	Violin plot of motif abundances.

secomo.utils.saveMotifs(model, path, name='mot', fformat='jaspar')[source]¶

Save weight-matrices as PFMs.

This method converts the cRBM weight-matrices to position frequency matrices and stores each matrix in a single file with ending .pfm.

model : CRBM object: A cRBM object.
path : str: Directory in which the PFMs should be stored.
name : str: File prefix for the motif files. Default: ‘mot’.
fformat : str: File format of the motifs. Either ‘jaspar’ or ‘tab’. Default: ‘jaspar’.

secomo.utils.positionalDensityPlot(model, seqs, filename=None)[source]¶

Positional enrichment of the motifs.

This function creates a figure that illustrates a positional enrichment of all cRBM motifs in the given set of sequences.

model : CRBM object: A cRBM object
seqs : numpy-array: A set of DNA sequences in one-hot encoding. See crbm.sequences.seqToOneHot()
filename : str: Filename for storing the figure. If filename = None, no figure will be stored.

secomo.utils.runTSNE(model, seqs)[source]¶

Run t-SNE on the motif abundances.

This function produces a clustering of the sequences using t-SNE based on the motif matches in the sequences. Accordingly, the sequences are projected onto a 2D hyper-plane in which similar sequences are located in close proximity.

model : CRBM object: A cRBM object
seqs : numpy-array: A set of DNA sequences in one-hot encoding. See crbm.sequences.seqToOneHot()

secomo.utils.tsneScatter(data, lims=None, colors=None, filename=None, legend=True)[source]¶

Scatter plot of t-SNE clustering.

data : dict: Dictionary containing the dataset name (keys) and data itself (values). The data is assumed to have been generated using runTSNE().
lims : tuple: Optional parameter containing the x- and y-limits for the figure. If None, the limits are automatically determined. For example: lims = ([xmin, ymin], [xmax, ymax])
colors : matplotlib.cm: Optional colormap to illustrate the datapoints. If None, a default colormap will be used.
filename : str: Filename for storing the figure. Default: None, means the figure stored but directly displayed.
legend : bool: Include the legend into the figure. Default: True

secomo.utils.createSeqLogos(model, path, fformat='eps')[source]¶

Create sequence logos for all cRBM motifs

model : CRBM object: A cRBM object.
path : str: Output folder.
fformat : str: File format for storing the sequence logos. Default: ‘eps’.

secomo.utils.createSeqLogo(pfm, filename, fformat='eps')[source]¶

Create sequence logo for an individual cRBM motif.

pfm : numpy-array: 2D numpy array representing a PFM. See CRBM.getPFMs()
path : str: Output folder.
fformat : str: File format for storing the sequence logos. Default: ‘eps’.

secomo.utils.tsneScatterWithPies(model, seqs, tsne, lims=None, filename=None)[source]¶

Scatter plot of t-SNE clustering.

This function produces a figure in which sequences are represented as pie chart in a 2D t-SNE hyper-plane obtained with runTSNE(). Moreover, the pie pieces correspond to the individual CRBM motifs and the sizes represent the enrichment/abundance of the motifs in the respective sequences.

model : CRBM object

A cRBM object

seqs : numpy-array

DNA sequences represented in one-hot encoding. See crbm.sequences.seqToOneHot().

tsne : numpy-array

2D numpy array representing the sequences projected onto the t-SNE hyper-plane that was obtained with runTSNE().

lims : tuple

Optional parameter containing the x- and y-limits for the figure. For example:

([xmin, ymin], [xmax, ymax])

filename : str

Filename for storing the figure.

secomo.utils.violinPlotMotifMatches(model, data, filename=None)[source]¶

Violin plot of motif abundances.

This function summarized the relative motif abundances of the CRBM motifs in a given set of sequences (e.g. sequences with different functions).

model : CRBM object: A cRBM object
data : dict: Dictionary with keys representing dataset-names and values a set of DNA sequences in one-hot encoding. See crbm.sequences.seqToOneHot().
filename : str: Filename to store the figure. Default: None, the figure will be dirctly disployed.

Sample dataset¶

The package contains a small sample dataset consisting of Oct4 ChIP-seq sequences of embryonic stem cells from ENCODE [1].

secomo.sequences.load_sample()[source]¶

Load sample sequences of Oct4 ChIP-seq peaks from H1hesc cell of ENCODE. The sequences are converted to one-hot encoding.

returns : numpy-array: Sample DNA sequences in one-hot encoding.

[1]	ENCODE Project Consortium and others. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature.

`readSeqsFromFasta`(filename)	Read sequences from multi-fasta file.
`seqToOneHot`(seqs)	Converts a set of Biopython DNA sequences to one-hot encoding.
`splitTrainingTest`(filename, train_test_ratio)	Splits training and test set.

Convolutional restricted Boltzmann machine (cRBM)¶

Sequence-related utilies¶

Utils¶

Sample dataset¶