settings.yaml ============= The initial barrier in analyzing the LINCS L1000 dataset is the non-standardised structure of the data released to GEO. Although all levels of expression data are available in `GCTx format `_, no metadata other than gene and profile identifiers are included. Instead, all metadata is kept in separate text files. This decision was likely intended to keep file sizes smaller, avoiding redundancy, but the current structure requires a significant data assembly step in order to filter and subset samples. Cell- and perturbation-specific metadata are also kept in separate files, which require non-trivial key-value lookup operations to populate a per-profile metadata table. The YAML specification for declares the hierarchical relational structure between the data provided by LINCS. The ``data`` field should specify a gene expression matrix, indicating the shared index names with ``gene_metadata`` and ``sample_metadata``. Within each of the metadata fields, a ``main`` file should be specified which has a shared index with ``data``. Additional ``lookup`` data can be specified in which a ``lookup_key`` indicates a shared column with the ``main`` data. This file is parsed via ``Dataset.from_yaml`` which assembles the data automatically in-memory. Required fields: ---------------- * ``data_dir``: path to directory of LINCS data * ``data``: main expression dataset - ``gene_index_name``: shared index with gene metadata - ``sample_index_name``: shared index with sample metadata * ``gene_metadata``/ ``sample_metadata``: gene and sample metadata text file(s) - ``main``: required metadata with shared indices - ``name``: name of data - ``file``: file name - ``index_col``: index column shared with ``data`` - ``lookup``: optional metadata files - ``name``: name of data - ``file``: file name - ``lookup_key``: shared key with ``main`` used to merge data Additional fields in each file spec are used to specify any necessary keywords arguments (i.e. ``use_cols``, ``sep``, ``na_values``) used by ``pandas.read_csv`` to read the metadata. An example can be seen below. .. code-block:: bash # settings.yaml data_dir: data/ data: file: Level3_INF_mlr12k_n1319138x12328.gctx gene_index_name: gene_id sample_index_name: inst_id gene_metadata: main: name: gene_info file: gene_info.txt index_col: gene_id use_cols: - gene_id - gene_symbol na_values: "-666" sep: "\t" sample_metadata: main: name: inst_info file: inst_info.txt index_col: inst_id usecols: - inst_id - cell_id na_values: "-666" sep: "\t" lookup: - name: cell_info file: cell_info.txt lookup_key: cell_id usecols: - cell_id - cell_type - precursor_cell_id - sample_type - primary_site - subtype na_values: "-666" sep: "\t" - name: pert_info file: pert_info.txt lookup_key: pert_id usecols: - pert_id - pert_type - inchi_key_prefix - inchi_key - canonical_smile na_values: "-666" sep: "\t"