Dataset¶

DeepLincs offers Dataset to wrangle L1000 data.

Dataset¶

class deep_lincs.dataset.Dataset(data, gene_meta, n_genes)[source]¶

Represents an L1000 Dataset

Parameters:

data : dataframe, shape (n_samples, (n_genes + n_metadata_fields)): A sample by gene expression matrix padded to the right with per sample metadata. Generally it is easiest to construct a Dataset from a class method, Dataset.from_yaml() or Dataset.from_dataframes().
gene_meta : dataframe, shape (n_genes, n_features): Contains the metadata for each of the genes in the data matrix.
n_genes : int: Number of genes in expression matrix. This explicitly defines the column index which divides the expression values and metadata.

Attributes:

data : dataframe, shape (n_samples, n_genes): A dataframe representing the sample x gene expression matrix
sample_meta : dataframe, shape (n_samples, n_metadata_features): A dataframe representing the per sample metadata
gene_meta : dataframe, shape (n_genes, n_gene_features): Gene metadata. Row index same as Dataset.data.columns.

__init__(self, data, gene_meta, n_genes)[source]¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`(self, data, gene_meta, n_genes)	Initialize self.
`copy`(self)	Copies Dataset to a new object
`from_yaml`(path[, sample_ids, only_landmark])	Dataset constructor method from yaml specification
`from_dataframes`(data_df, sample_meta_df, …)	Dataset constructor method from multiple dataframes
`sample_rows`(self, size[, replace, meta_groups])	Returns a Dataset of sampled profiles
`filter_rows`(self, \\kwargs)	Returns a Dataset of filtered profiles
`select_meta`(self, meta_fields)	Returns a Dataset with select metadata fields.
`select_samples`(self, sample_ids)	Returns a Dataset with profiles selected by id
`split`(self, \\kwargs)	Returns a tuple of Datasets, split by inclusion criteria
`dropna`(self, subset[, inplace])	Drops profiles for which there is no metadata in subset
`set_categorical`(self, meta_field)	Sets sample metadata column as categorical
`normalize_by_gene`(self[, normalizer])	Normalize expression by gene
`train_val_test_split`(self[, p1, p2])	Splits dataset into training, validation, and test datasets
`to_tsv`(self, out_dir[, sep, prefix])	Write Dataset object to a tsv file
`one_hot_encode`(self, meta_field)	Return a one-hot vector for a metadata field for all profiles
`plot_gene_boxplot`(self, identifier[, …])	Returns a boxplot of gene expression, faceted on metadata field
`plot_meta_counts`(self, meta_field[, …])	Returns a barplot of a metadata field counts in Dataset

copy(self)[source]¶: Copies Dataset to a new object

dropna(self, subset, inplace=False)[source]¶

Drops profiles for which there is no metadata in subset

Parameters:	subset : `str` or `list` Metadata field or fields. inplace : `bool` (optional, default: `False`) If True, do operation inplace and return None.

filter_rows(self, **kwargs)[source]¶

Returns a Dataset of filtered profiles

Parameters:	kwargs : Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following. `keyword`: a column in metadata, `arg`: a list of values to filter from keyword field.
Returns:	`Dataset` >>> dataset.filter_rows(cell_id=["VCAP, PC3"]) .. >>> dataset.filter_rows(cell_id="VCAP", pert_type=["ctl_vehicle", "trt_cp"]) ..

classmethod from_dataframes(data_df, sample_meta_df, gene_meta_df)[source]¶

Dataset constructor method from multiple dataframes

Parameters:

data_df : dataframe, shape (n_samples, n_genes): Contains the expression data from experiment. Must have shared row index with sample_meta_df.
sample_meta_df : dataframe, shape (n_samples, n_meta_features): Contains the metadata for each of the samples in experiment.
gene_meta_df : dataframe, shape (n_genes, n_gene_features): Contains the metadata for each of the genes in experiment.

classmethod from_yaml(path, sample_ids=None, only_landmark=True, **filter_kwargs)[source]¶

Dataset constructor method from yaml specification

Parameters:

path : str: Valid string path to .yaml or .yml file.
sample_ids : list (optional, default None): Unique sample ids to read from data and metadata files.
only_landmark : bool (optional, default True): Whether to parse all genes or only the landmark.
filter_kwargs :: Optional keyword args to subset data by specific features in per sample metadata. Each kwarg must follow the following. keyword - a column in metadata arg - a list of values to filter from keyword field.

Returns:

Dataset

>>> Dataset.from_yaml("settings.yaml", cell_id=["MCF7", "PC3"], pert_id=["trt_cp"])
    ..

normalize_by_gene(self, normalizer='standard_scale')[source]¶

Normalize expression by gene

Parameters:	normalizer : `str` or `func` (optional, default ‘standard_scale’) Method used normalise dataset. Valid str options are ‘standard_scale’ and ‘z_score’. If a function is provided, it must take one argument (`array`), and return an array of the same dimensions.
Returns:	`None`

one_hot_encode(self, meta_field)[source]¶

Return a one-hot vector for a metadata field for all profiles

Parameters:	meta_field : `str` Valid sample metadata column.
Returns:	one_hot : `array`, (n_samples, n_categories)

plot_gene_boxplot(self, identifier, lookup_col=None, meta_field=None, extent=1.5)[source]¶

Returns a boxplot of gene expression, faceted on metadata field

Parameters:

identifier : str: String identifier for gene. Default should be one of self.gene_meta.index.
lookup_col : str (optional, default None): Gene metadata column name. Will be used to lookup identifier param rather than index.
meta_field : str (optional, default None): Sample metadata column name. Will make multiple boxplots for each metadata category.
extent : str or float (optional, default 1.5): Can be either 'min-max', with whiskers covering entire domain, or an number X where entries outside X stds are shown as individual points.

Returns:

altair.Chart object

>>> dataset.plot_gene_boxplot("Gene A", lookup_col="gene_name", meta_field="cell_id")
    ..

>>> dataset.plot_gene_boxplot("5270") // dsitribution for gene_id == '5270')
    ..

plot_meta_counts(self, meta_field, normalize=False, sort_values=True)[source]¶

Returns a barplot of a metadata field counts in Dataset

Parameters:	meta_field : `str` Valid sample metadata column. normalize : `bool` (optional, default `False`) Whether to show counts or noramlize to frequencies. sort_values : `bool` (optional, default `True`) Whether to sort barchart by counts/frequencies.
Returns:	`altair.Chart` object >>> dataset.plot_meta_counts("cell_id", normalize=True) // barplot of cell_id frequencies

sample_rows(self, size, replace=False, meta_groups=None)[source]¶

Returns a Dataset of sampled profiles

Parameters:	size : `int` Number of samples to return per meta grouping. Default is to sample from all profiles. replace : `bool` (optional, default `False`) Sample with or without replacement. meta_groups : `str` or `list` (optional, default `None`) If provided, equal numbers of profiles are returned for each metadata grouping.
Returns:	`Dataset` >>> dataset.sample_rows(size=5000, meta_groups="cell_id") // returns 5000 profiles for each cell_id in dataset >>> dataset.sample_rows(size=5000, meta_groups=["cell_id", "pert_type"]) // returns 5000 profiles for all groupings of cell_id and pert_type

select_meta(self, meta_fields)[source]¶

Returns a Dataset with select metadata fields.

Parameters:	meta_fields : `list` Desired metadata columns.
Returns:	`Dataset` >>> dataset.select_meta(["cell_id", "pert_id", "moa"]) // returns dataset with only ["cell_id", "pert_id", "moa"] as metadata fields.

select_samples(self, sample_ids)[source]¶

Returns a Dataset with profiles selected by id

Parameters:	sample_ids : `list`, character `array` Desired sample ids to filter dataset.
Returns:	`Dataset`

set_categorical(self, meta_field)[source]¶

Sets sample metadata column as categorical

Parameters:	meta_field : `str` Sample metadata column name.

split(self, **kwargs)[source]¶

Returns a tuple of Datasets, split by inclusion criteria

Parameters:	kwargs : Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following. `keyword`: a column in metadata, `arg`: a str or list of values to filter from keyword field.
Returns:	`Dataset`, `Dataset` >>> pc3, not_pc3 = dataset.split(cell_id="PC3") .. >>> vcap_mcf7, not_vcap_mcf7 = dataset.split(cell_id=["VCAP", "MCF7"]) ..

to_tsv(self, out_dir, sep='t', prefix=None, **kwargs)[source]¶

Write Dataset object to a tsv file

Parameters:	out_dir : `str` Path to output directory. sep : `str` (optional) String of length 1. Field delimiter for the output file. prefix : `str` (optional, default `None`) Filename prefix.

train_val_test_split(self, p1=0.2, p2=0.2)[source]¶

Splits dataset into training, validation, and test datasets

Parameters:	p1 : `float` (optional: default `0.2` ) Test size in first train/test split. p2 : `float` (optional: default `0.2` ) Validation size in remaining train/val split.
Returns:	`tuple` of `Dataset`’s

data¶: A dataframe representing the sample x gene expression matrix

sample_meta¶: A dataframe representing the per sample metadata

Dataset¶

Dataset¶

DeepLINCS

Navigation

Related Topics