Dataset

DeepLincs offers Dataset to wrangle L1000 data.

Dataset

class deep_lincs.dataset.Dataset(data, gene_meta, n_genes)[source]

Represents an L1000 Dataset

Parameters:
data : dataframe, shape (n_samples, (n_genes + n_metadata_fields))

A sample by gene expression matrix padded to the right with per sample metadata. Generally it is easiest to construct a Dataset from a class method, Dataset.from_yaml() or Dataset.from_dataframes().

gene_meta : dataframe, shape (n_genes, n_features)

Contains the metadata for each of the genes in the data matrix.

n_genes : int

Number of genes in expression matrix. This explicitly defines the column index which divides the expression values and metadata.

Attributes:
data : dataframe, shape (n_samples, n_genes)

A dataframe representing the sample x gene expression matrix

sample_meta : dataframe, shape (n_samples, n_metadata_features)

A dataframe representing the per sample metadata

gene_meta : dataframe, shape (n_genes, n_gene_features)

Gene metadata. Row index same as Dataset.data.columns.

__init__(self, data, gene_meta, n_genes)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(self, data, gene_meta, n_genes) Initialize self.
copy(self) Copies Dataset to a new object
from_yaml(path[, sample_ids, only_landmark]) Dataset constructor method from yaml specification
from_dataframes(data_df, sample_meta_df, …) Dataset constructor method from multiple dataframes
sample_rows(self, size[, replace, meta_groups]) Returns a Dataset of sampled profiles
filter_rows(self, \*\*kwargs) Returns a Dataset of filtered profiles
select_meta(self, meta_fields) Returns a Dataset with select metadata fields.
select_samples(self, sample_ids) Returns a Dataset with profiles selected by id
split(self, \*\*kwargs) Returns a tuple of Datasets, split by inclusion criteria
dropna(self, subset[, inplace]) Drops profiles for which there is no metadata in subset
set_categorical(self, meta_field) Sets sample metadata column as categorical
normalize_by_gene(self[, normalizer]) Normalize expression by gene
train_val_test_split(self[, p1, p2]) Splits dataset into training, validation, and test datasets
to_tsv(self, out_dir[, sep, prefix]) Write Dataset object to a tsv file
one_hot_encode(self, meta_field) Return a one-hot vector for a metadata field for all profiles
plot_gene_boxplot(self, identifier[, …]) Returns a boxplot of gene expression, faceted on metadata field
plot_meta_counts(self, meta_field[, …]) Returns a barplot of a metadata field counts in Dataset
copy(self)[source]

Copies Dataset to a new object

dropna(self, subset, inplace=False)[source]

Drops profiles for which there is no metadata in subset

Parameters:
subset : str or list

Metadata field or fields.

inplace : bool (optional, default: False)

If True, do operation inplace and return None.

filter_rows(self, **kwargs)[source]

Returns a Dataset of filtered profiles

Parameters:
kwargs :

Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following. keyword: a column in metadata, arg: a list of values to filter from keyword field.

Returns:
Dataset
>>> dataset.filter_rows(cell_id=["VCAP, PC3"])
    ..
>>> dataset.filter_rows(cell_id="VCAP", pert_type=["ctl_vehicle", "trt_cp"])
    ..
classmethod from_dataframes(data_df, sample_meta_df, gene_meta_df)[source]

Dataset constructor method from multiple dataframes

Parameters:
data_df : dataframe, shape (n_samples, n_genes)

Contains the expression data from experiment. Must have shared row index with sample_meta_df.

sample_meta_df : dataframe, shape (n_samples, n_meta_features)

Contains the metadata for each of the samples in experiment.

gene_meta_df : dataframe, shape (n_genes, n_gene_features)

Contains the metadata for each of the genes in experiment.

classmethod from_yaml(path, sample_ids=None, only_landmark=True, **filter_kwargs)[source]

Dataset constructor method from yaml specification

Parameters:
path : str

Valid string path to .yaml or .yml file.

sample_ids : list (optional, default None)

Unique sample ids to read from data and metadata files.

only_landmark : bool (optional, default True)

Whether to parse all genes or only the landmark.

filter_kwargs :

Optional keyword args to subset data by specific features in per sample metadata. Each kwarg must follow the following. keyword - a column in metadata arg - a list of values to filter from keyword field.

Returns:
Dataset
>>> Dataset.from_yaml("settings.yaml", cell_id=["MCF7", "PC3"], pert_id=["trt_cp"])
    ..
normalize_by_gene(self, normalizer='standard_scale')[source]

Normalize expression by gene

Parameters:
normalizer : str or func (optional, default ‘standard_scale’)

Method used normalise dataset. Valid str options are ‘standard_scale’ and ‘z_score’. If a function is provided, it must take one argument (array), and return an array of the same dimensions.

Returns:
None
one_hot_encode(self, meta_field)[source]

Return a one-hot vector for a metadata field for all profiles

Parameters:
meta_field : str

Valid sample metadata column.

Returns:
one_hot :

array, (n_samples, n_categories)

plot_gene_boxplot(self, identifier, lookup_col=None, meta_field=None, extent=1.5)[source]

Returns a boxplot of gene expression, faceted on metadata field

Parameters:
identifier : str

String identifier for gene. Default should be one of self.gene_meta.index.

lookup_col : str (optional, default None)

Gene metadata column name. Will be used to lookup identifier param rather than index.

meta_field : str (optional, default None)

Sample metadata column name. Will make multiple boxplots for each metadata category.

extent : str or float (optional, default 1.5)

Can be either 'min-max', with whiskers covering entire domain, or an number X where entries outside X stds are shown as individual points.

Returns:
altair.Chart object
>>> dataset.plot_gene_boxplot("Gene A", lookup_col="gene_name", meta_field="cell_id")
    ..
>>> dataset.plot_gene_boxplot("5270") // dsitribution for gene_id == '5270')
    ..
plot_meta_counts(self, meta_field, normalize=False, sort_values=True)[source]

Returns a barplot of a metadata field counts in Dataset

Parameters:
meta_field : str

Valid sample metadata column.

normalize : bool (optional, default False)

Whether to show counts or noramlize to frequencies.

sort_values : bool (optional, default True)

Whether to sort barchart by counts/frequencies.

Returns:
altair.Chart object
>>> dataset.plot_meta_counts("cell_id", normalize=True) // barplot of cell_id frequencies
sample_rows(self, size, replace=False, meta_groups=None)[source]

Returns a Dataset of sampled profiles

Parameters:
size : int

Number of samples to return per meta grouping. Default is to sample from all profiles.

replace : bool (optional, default False)

Sample with or without replacement.

meta_groups : str or list (optional, default None)

If provided, equal numbers of profiles are returned for each metadata grouping.

Returns:
Dataset
>>> dataset.sample_rows(size=5000, meta_groups="cell_id")
    // returns 5000 profiles for each cell_id in dataset
>>> dataset.sample_rows(size=5000, meta_groups=["cell_id", "pert_type"])
    // returns 5000 profiles for all groupings of cell_id and pert_type
select_meta(self, meta_fields)[source]

Returns a Dataset with select metadata fields.

Parameters:
meta_fields : list

Desired metadata columns.

Returns:
Dataset
>>> dataset.select_meta(["cell_id", "pert_id", "moa"])
    // returns dataset with only ["cell_id", "pert_id", "moa"] as metadata fields.
select_samples(self, sample_ids)[source]

Returns a Dataset with profiles selected by id

Parameters:
sample_ids : list, character array

Desired sample ids to filter dataset.

Returns:
Dataset
set_categorical(self, meta_field)[source]

Sets sample metadata column as categorical

Parameters:
meta_field : str

Sample metadata column name.

split(self, **kwargs)[source]

Returns a tuple of Datasets, split by inclusion criteria

Parameters:
kwargs :

Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following. keyword: a column in metadata, arg: a str or list of values to filter from keyword field.

Returns:
Dataset, Dataset
>>> pc3, not_pc3 = dataset.split(cell_id="PC3")
    ..
>>> vcap_mcf7, not_vcap_mcf7 = dataset.split(cell_id=["VCAP", "MCF7"])
    ..
to_tsv(self, out_dir, sep='t', prefix=None, **kwargs)[source]

Write Dataset object to a tsv file

Parameters:
out_dir : str

Path to output directory.

sep : str (optional)

String of length 1. Field delimiter for the output file.

prefix : str (optional, default None)

Filename prefix.

train_val_test_split(self, p1=0.2, p2=0.2)[source]

Splits dataset into training, validation, and test datasets

Parameters:
p1 : float (optional: default 0.2 )

Test size in first train/test split.

p2 : float (optional: default 0.2 )

Validation size in remaining train/val split.

Returns:
tuple of Dataset’s
data

A dataframe representing the sample x gene expression matrix

sample_meta

A dataframe representing the per sample metadata