Dataset¶
DeepLincs offers Dataset to wrangle L1000 data.
Dataset¶
-
class
deep_lincs.dataset.Dataset(data, gene_meta, n_genes)[source]¶ Represents an L1000 Dataset
Parameters: - data :
dataframe, shape (n_samples, (n_genes + n_metadata_fields)) A sample by gene expression matrix padded to the right with per sample metadata. Generally it is easiest to construct a Dataset from a class method,
Dataset.from_yaml()orDataset.from_dataframes().- gene_meta :
dataframe, shape (n_genes, n_features) Contains the metadata for each of the genes in the data matrix.
- n_genes :
int Number of genes in expression matrix. This explicitly defines the column index which divides the expression values and metadata.
Attributes: data:dataframe, shape (n_samples, n_genes)A dataframe representing the sample x gene expression matrix
sample_meta:dataframe, shape (n_samples, n_metadata_features)A dataframe representing the per sample metadata
- gene_meta :
dataframe, shape (n_genes, n_gene_features) Gene metadata. Row index same as
Dataset.data.columns.
-
__init__(self, data, gene_meta, n_genes)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__(self, data, gene_meta, n_genes)Initialize self. copy(self)Copies Dataset to a new object from_yaml(path[, sample_ids, only_landmark])Dataset constructor method from yaml specification from_dataframes(data_df, sample_meta_df, …)Dataset constructor method from multiple dataframes sample_rows(self, size[, replace, meta_groups])Returns a Dataset of sampled profiles filter_rows(self, \*\*kwargs)Returns a Dataset of filtered profiles select_meta(self, meta_fields)Returns a Dataset with select metadata fields. select_samples(self, sample_ids)Returns a Dataset with profiles selected by id split(self, \*\*kwargs)Returns a tuple of Datasets, split by inclusion criteria dropna(self, subset[, inplace])Drops profiles for which there is no metadata in subset set_categorical(self, meta_field)Sets sample metadata column as categorical normalize_by_gene(self[, normalizer])Normalize expression by gene train_val_test_split(self[, p1, p2])Splits dataset into training, validation, and test datasets to_tsv(self, out_dir[, sep, prefix])Write Dataset object to a tsv file one_hot_encode(self, meta_field)Return a one-hot vector for a metadata field for all profiles plot_gene_boxplot(self, identifier[, …])Returns a boxplot of gene expression, faceted on metadata field plot_meta_counts(self, meta_field[, …])Returns a barplot of a metadata field counts in Dataset -
dropna(self, subset, inplace=False)[source]¶ Drops profiles for which there is no metadata in subset
Parameters: - subset :
strorlist Metadata field or fields.
- inplace :
bool(optional, default:False) If True, do operation inplace and return None.
- subset :
-
filter_rows(self, **kwargs)[source]¶ Returns a Dataset of filtered profiles
Parameters: - kwargs :
Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following.
keyword: a column in metadata,arg: a list of values to filter from keyword field.
Returns: Dataset
>>> dataset.filter_rows(cell_id=["VCAP, PC3"]) ..
>>> dataset.filter_rows(cell_id="VCAP", pert_type=["ctl_vehicle", "trt_cp"]) ..
-
classmethod
from_dataframes(data_df, sample_meta_df, gene_meta_df)[source]¶ Dataset constructor method from multiple dataframes
Parameters: - data_df : dataframe, shape (n_samples, n_genes)
Contains the expression data from experiment. Must have shared row index with
sample_meta_df.- sample_meta_df :
dataframe, shape (n_samples, n_meta_features) Contains the metadata for each of the samples in experiment.
- gene_meta_df : dataframe, shape (n_genes, n_gene_features)
Contains the metadata for each of the genes in experiment.
-
classmethod
from_yaml(path, sample_ids=None, only_landmark=True, **filter_kwargs)[source]¶ Dataset constructor method from yaml specification
Parameters: - path :
str Valid string path to
.yamlor.ymlfile.- sample_ids :
list(optional, defaultNone) Unique sample ids to read from data and metadata files.
- only_landmark :
bool(optional, defaultTrue) Whether to parse all genes or only the landmark.
- filter_kwargs :
Optional keyword args to subset data by specific features in per sample metadata. Each kwarg must follow the following.
keyword- a column in metadataarg- a list of values to filter from keyword field.
Returns: Dataset
>>> Dataset.from_yaml("settings.yaml", cell_id=["MCF7", "PC3"], pert_id=["trt_cp"]) ..
- path :
-
normalize_by_gene(self, normalizer='standard_scale')[source]¶ Normalize expression by gene
Parameters: - normalizer :
strorfunc(optional, default ‘standard_scale’) Method used normalise dataset. Valid str options are ‘standard_scale’ and ‘z_score’. If a function is provided, it must take one argument (
array), and return an array of the same dimensions.
Returns: None
- normalizer :
-
one_hot_encode(self, meta_field)[source]¶ Return a one-hot vector for a metadata field for all profiles
Parameters: - meta_field :
str Valid sample metadata column.
Returns: - one_hot :
array, (n_samples, n_categories)
- meta_field :
-
plot_gene_boxplot(self, identifier, lookup_col=None, meta_field=None, extent=1.5)[source]¶ Returns a boxplot of gene expression, faceted on metadata field
Parameters: - identifier :
str String identifier for gene. Default should be one of self.gene_meta.index.
- lookup_col :
str(optional, defaultNone) Gene metadata column name. Will be used to lookup identifier param rather than index.
- meta_field :
str(optional, defaultNone) Sample metadata column name. Will make multiple boxplots for each metadata category.
- extent :
strorfloat(optional, default1.5) Can be either
'min-max', with whiskers covering entire domain, or an number X where entries outside X stds are shown as individual points.
Returns: altair.Chartobject
>>> dataset.plot_gene_boxplot("Gene A", lookup_col="gene_name", meta_field="cell_id") ..
>>> dataset.plot_gene_boxplot("5270") // dsitribution for gene_id == '5270') ..
- identifier :
-
plot_meta_counts(self, meta_field, normalize=False, sort_values=True)[source]¶ Returns a barplot of a metadata field counts in Dataset
Parameters: - meta_field :
str Valid sample metadata column.
- normalize :
bool(optional, defaultFalse) Whether to show counts or noramlize to frequencies.
- sort_values :
bool(optional, defaultTrue) Whether to sort barchart by counts/frequencies.
- sort_values :
Returns: altair.Chartobject>>> dataset.plot_meta_counts("cell_id", normalize=True) // barplot of cell_id frequencies
- meta_field :
-
sample_rows(self, size, replace=False, meta_groups=None)[source]¶ Returns a Dataset of sampled profiles
Parameters: - size :
int Number of samples to return per meta grouping. Default is to sample from all profiles.
- replace :
bool(optional, defaultFalse) Sample with or without replacement.
- meta_groups :
strorlist(optional, defaultNone) If provided, equal numbers of profiles are returned for each metadata grouping.
Returns: Dataset
>>> dataset.sample_rows(size=5000, meta_groups="cell_id") // returns 5000 profiles for each cell_id in dataset
>>> dataset.sample_rows(size=5000, meta_groups=["cell_id", "pert_type"]) // returns 5000 profiles for all groupings of cell_id and pert_type
- size :
-
select_meta(self, meta_fields)[source]¶ Returns a Dataset with select metadata fields.
Parameters: - meta_fields :
list Desired metadata columns.
Returns: Dataset
>>> dataset.select_meta(["cell_id", "pert_id", "moa"]) // returns dataset with only ["cell_id", "pert_id", "moa"] as metadata fields.
- meta_fields :
-
select_samples(self, sample_ids)[source]¶ Returns a Dataset with profiles selected by id
Parameters: - sample_ids :
list, characterarray Desired sample ids to filter dataset.
Returns: Dataset
- sample_ids :
-
set_categorical(self, meta_field)[source]¶ Sets sample metadata column as categorical
Parameters: - meta_field :
str Sample metadata column name.
- meta_field :
-
split(self, **kwargs)[source]¶ Returns a tuple of Datasets, split by inclusion criteria
Parameters: - kwargs :
Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following.
keyword: a column in metadata,arg: a str or list of values to filter from keyword field.
Returns: Dataset,Dataset
>>> pc3, not_pc3 = dataset.split(cell_id="PC3") ..
>>> vcap_mcf7, not_vcap_mcf7 = dataset.split(cell_id=["VCAP", "MCF7"]) ..
-
to_tsv(self, out_dir, sep='t', prefix=None, **kwargs)[source]¶ Write Dataset object to a tsv file
Parameters: - out_dir :
str Path to output directory.
- sep :
str(optional) String of length 1. Field delimiter for the output file.
- prefix :
str(optional, defaultNone) Filename prefix.
- out_dir :
-
train_val_test_split(self, p1=0.2, p2=0.2)[source]¶ Splits dataset into training, validation, and test datasets
Parameters: - p1 :
float(optional: default0.2) Test size in first train/test split.
- p2 :
float(optional: default0.2) Validation size in remaining train/val split.
Returns: tupleofDataset’s
- p1 :
-
data¶ A dataframe representing the sample x gene expression matrix
-
sample_meta¶ A dataframe representing the per sample metadata
- data :