Dataset¶
DeepLincs offers Dataset
to wrangle L1000 data.
Dataset¶
-
class
deep_lincs.dataset.
Dataset
(data, gene_meta, n_genes)[source]¶ Represents an L1000 Dataset
Parameters: - data :
dataframe
, shape (n_samples, (n_genes + n_metadata_fields)) A sample by gene expression matrix padded to the right with per sample metadata. Generally it is easiest to construct a Dataset from a class method,
Dataset.from_yaml()
orDataset.from_dataframes()
.- gene_meta :
dataframe
, shape (n_genes, n_features) Contains the metadata for each of the genes in the data matrix.
- n_genes :
int
Number of genes in expression matrix. This explicitly defines the column index which divides the expression values and metadata.
Attributes: data
:dataframe
, shape (n_samples, n_genes)A dataframe representing the sample x gene expression matrix
sample_meta
:dataframe
, shape (n_samples, n_metadata_features)A dataframe representing the per sample metadata
- gene_meta :
dataframe
, shape (n_genes, n_gene_features) Gene metadata. Row index same as
Dataset.data.columns
.
-
__init__
(self, data, gene_meta, n_genes)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
(self, data, gene_meta, n_genes)Initialize self. copy
(self)Copies Dataset to a new object from_yaml
(path[, sample_ids, only_landmark])Dataset constructor method from yaml specification from_dataframes
(data_df, sample_meta_df, …)Dataset constructor method from multiple dataframes sample_rows
(self, size[, replace, meta_groups])Returns a Dataset of sampled profiles filter_rows
(self, \*\*kwargs)Returns a Dataset of filtered profiles select_meta
(self, meta_fields)Returns a Dataset with select metadata fields. select_samples
(self, sample_ids)Returns a Dataset with profiles selected by id split
(self, \*\*kwargs)Returns a tuple of Datasets, split by inclusion criteria dropna
(self, subset[, inplace])Drops profiles for which there is no metadata in subset set_categorical
(self, meta_field)Sets sample metadata column as categorical normalize_by_gene
(self[, normalizer])Normalize expression by gene train_val_test_split
(self[, p1, p2])Splits dataset into training, validation, and test datasets to_tsv
(self, out_dir[, sep, prefix])Write Dataset object to a tsv file one_hot_encode
(self, meta_field)Return a one-hot vector for a metadata field for all profiles plot_gene_boxplot
(self, identifier[, …])Returns a boxplot of gene expression, faceted on metadata field plot_meta_counts
(self, meta_field[, …])Returns a barplot of a metadata field counts in Dataset -
dropna
(self, subset, inplace=False)[source]¶ Drops profiles for which there is no metadata in subset
Parameters: - subset :
str
orlist
Metadata field or fields.
- inplace :
bool
(optional, default:False
) If True, do operation inplace and return None.
- subset :
-
filter_rows
(self, **kwargs)[source]¶ Returns a Dataset of filtered profiles
Parameters: - kwargs :
Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following.
keyword
: a column in metadata,arg
: a list of values to filter from keyword field.
Returns: Dataset
>>> dataset.filter_rows(cell_id=["VCAP, PC3"]) ..
>>> dataset.filter_rows(cell_id="VCAP", pert_type=["ctl_vehicle", "trt_cp"]) ..
-
classmethod
from_dataframes
(data_df, sample_meta_df, gene_meta_df)[source]¶ Dataset constructor method from multiple dataframes
Parameters: - data_df : dataframe, shape (n_samples, n_genes)
Contains the expression data from experiment. Must have shared row index with
sample_meta_df
.- sample_meta_df :
dataframe
, shape (n_samples, n_meta_features) Contains the metadata for each of the samples in experiment.
- gene_meta_df : dataframe, shape (n_genes, n_gene_features)
Contains the metadata for each of the genes in experiment.
-
classmethod
from_yaml
(path, sample_ids=None, only_landmark=True, **filter_kwargs)[source]¶ Dataset constructor method from yaml specification
Parameters: - path :
str
Valid string path to
.yaml
or.yml
file.- sample_ids :
list
(optional, defaultNone
) Unique sample ids to read from data and metadata files.
- only_landmark :
bool
(optional, defaultTrue
) Whether to parse all genes or only the landmark.
- filter_kwargs :
Optional keyword args to subset data by specific features in per sample metadata. Each kwarg must follow the following.
keyword
- a column in metadataarg
- a list of values to filter from keyword field.
Returns: Dataset
>>> Dataset.from_yaml("settings.yaml", cell_id=["MCF7", "PC3"], pert_id=["trt_cp"]) ..
- path :
-
normalize_by_gene
(self, normalizer='standard_scale')[source]¶ Normalize expression by gene
Parameters: - normalizer :
str
orfunc
(optional, default ‘standard_scale’) Method used normalise dataset. Valid str options are ‘standard_scale’ and ‘z_score’. If a function is provided, it must take one argument (
array
), and return an array of the same dimensions.
Returns: None
- normalizer :
-
one_hot_encode
(self, meta_field)[source]¶ Return a one-hot vector for a metadata field for all profiles
Parameters: - meta_field :
str
Valid sample metadata column.
Returns: - one_hot :
array
, (n_samples, n_categories)
- meta_field :
-
plot_gene_boxplot
(self, identifier, lookup_col=None, meta_field=None, extent=1.5)[source]¶ Returns a boxplot of gene expression, faceted on metadata field
Parameters: - identifier :
str
String identifier for gene. Default should be one of self.gene_meta.index.
- lookup_col :
str
(optional, defaultNone
) Gene metadata column name. Will be used to lookup identifier param rather than index.
- meta_field :
str
(optional, defaultNone
) Sample metadata column name. Will make multiple boxplots for each metadata category.
- extent :
str
orfloat
(optional, default1.5
) Can be either
'min-max'
, with whiskers covering entire domain, or an number X where entries outside X stds are shown as individual points.
Returns: altair.Chart
object
>>> dataset.plot_gene_boxplot("Gene A", lookup_col="gene_name", meta_field="cell_id") ..
>>> dataset.plot_gene_boxplot("5270") // dsitribution for gene_id == '5270') ..
- identifier :
-
plot_meta_counts
(self, meta_field, normalize=False, sort_values=True)[source]¶ Returns a barplot of a metadata field counts in Dataset
Parameters: - meta_field :
str
Valid sample metadata column.
- normalize :
bool
(optional, defaultFalse
) Whether to show counts or noramlize to frequencies.
- sort_values :
bool
(optional, defaultTrue
) Whether to sort barchart by counts/frequencies.
- sort_values :
Returns: altair.Chart
object>>> dataset.plot_meta_counts("cell_id", normalize=True) // barplot of cell_id frequencies
- meta_field :
-
sample_rows
(self, size, replace=False, meta_groups=None)[source]¶ Returns a Dataset of sampled profiles
Parameters: - size :
int
Number of samples to return per meta grouping. Default is to sample from all profiles.
- replace :
bool
(optional, defaultFalse
) Sample with or without replacement.
- meta_groups :
str
orlist
(optional, defaultNone
) If provided, equal numbers of profiles are returned for each metadata grouping.
Returns: Dataset
>>> dataset.sample_rows(size=5000, meta_groups="cell_id") // returns 5000 profiles for each cell_id in dataset
>>> dataset.sample_rows(size=5000, meta_groups=["cell_id", "pert_type"]) // returns 5000 profiles for all groupings of cell_id and pert_type
- size :
-
select_meta
(self, meta_fields)[source]¶ Returns a Dataset with select metadata fields.
Parameters: - meta_fields :
list
Desired metadata columns.
Returns: Dataset
>>> dataset.select_meta(["cell_id", "pert_id", "moa"]) // returns dataset with only ["cell_id", "pert_id", "moa"] as metadata fields.
- meta_fields :
-
select_samples
(self, sample_ids)[source]¶ Returns a Dataset with profiles selected by id
Parameters: - sample_ids :
list
, characterarray
Desired sample ids to filter dataset.
Returns: Dataset
- sample_ids :
-
set_categorical
(self, meta_field)[source]¶ Sets sample metadata column as categorical
Parameters: - meta_field :
str
Sample metadata column name.
- meta_field :
-
split
(self, **kwargs)[source]¶ Returns a tuple of Datasets, split by inclusion criteria
Parameters: - kwargs :
Keyword args to subset data by specific features in sample metadata. Each kwarg must follow the following.
keyword
: a column in metadata,arg
: a str or list of values to filter from keyword field.
Returns: Dataset
,Dataset
>>> pc3, not_pc3 = dataset.split(cell_id="PC3") ..
>>> vcap_mcf7, not_vcap_mcf7 = dataset.split(cell_id=["VCAP", "MCF7"]) ..
-
to_tsv
(self, out_dir, sep='t', prefix=None, **kwargs)[source]¶ Write Dataset object to a tsv file
Parameters: - out_dir :
str
Path to output directory.
- sep :
str
(optional) String of length 1. Field delimiter for the output file.
- prefix :
str
(optional, defaultNone
) Filename prefix.
- out_dir :
-
train_val_test_split
(self, p1=0.2, p2=0.2)[source]¶ Splits dataset into training, validation, and test datasets
Parameters: - p1 :
float
(optional: default0.2
) Test size in first train/test split.
- p2 :
float
(optional: default0.2
) Validation size in remaining train/val split.
Returns: tuple
ofDataset
’s
- p1 :
-
data
¶ A dataframe representing the sample x gene expression matrix
-
sample_meta
¶ A dataframe representing the per sample metadata
- data :