settings.yaml¶
The initial barrier in analyzing the LINCS L1000 dataset is the non-standardised structure of the data released to GEO. Although all levels of expression data are available in GCTx format, no metadata other than gene and profile identifiers are included. Instead, all metadata is kept in separate text files.
This decision was likely intended to keep file sizes smaller, avoiding redundancy, but the current structure requires a significant data assembly step in order to filter and subset samples. Cell- and perturbation-specific metadata are also kept in separate files, which require non-trivial key-value lookup operations to populate a per-profile metadata table.
The YAML specification for declares the hierarchical relational structure between the
data provided by LINCS. The data
field should specify a gene expression matrix, indicating
the shared index names with gene_metadata
and sample_metadata
. Within each of the metadata fields,
a main
file should be specified which has a shared index with data
. Additional lookup
data
can be specified in which a lookup_key
indicates a shared column with the main
data.
This file is parsed via Dataset.from_yaml
which assembles the data automatically in-memory.
Required fields:¶
data_dir
: path to directory of LINCS datadata
: main expression datasetgene_index_name
: shared index with gene metadatasample_index_name
: shared index with sample metadata
gene_metadata
/sample_metadata
: gene and sample metadata text file(s)main
: required metadata with shared indicesname
: name of datafile
: file nameindex_col
: index column shared withdata
lookup
: optional metadata filesname
: name of datafile
: file namelookup_key
: shared key withmain
used to merge data
Additional fields in each file spec are used to specify any necessary keywords arguments
(i.e. use_cols
, sep
, na_values
) used by pandas.read_csv
to read the metadata.
An example can be seen below.
# settings.yaml
data_dir: data/
data:
file: Level3_INF_mlr12k_n1319138x12328.gctx
gene_index_name: gene_id
sample_index_name: inst_id
gene_metadata:
main:
name: gene_info
file: gene_info.txt
index_col: gene_id
use_cols:
- gene_id
- gene_symbol
na_values: "-666"
sep: "\t"
sample_metadata:
main:
name: inst_info
file: inst_info.txt
index_col: inst_id
usecols:
- inst_id
- cell_id
na_values: "-666"
sep: "\t"
lookup:
- name: cell_info
file: cell_info.txt
lookup_key: cell_id
usecols:
- cell_id
- cell_type
- precursor_cell_id
- sample_type
- primary_site
- subtype
na_values: "-666"
sep: "\t"
- name: pert_info
file: pert_info.txt
lookup_key: pert_id
usecols:
- pert_id
- pert_type
- inchi_key_prefix
- inchi_key
- canonical_smile
na_values: "-666"
sep: "\t"