Configuration Format

DNADNA is configured through a few different config files, typically with one config file associated with each stage of the processing pipeline (simulation, pre-processing, training).

This page does not document the format of each config file in detail. These sill be documented in the chapters associated with individual processing stage. Rather, here we document the general structure and features of the configuration format.

Configuration structure

Each DNADNA configuration is in the form of a nested mapping/dictionary/hash table mapping string keywords to values, where each value may itself be a mapping.

It is a JSON-compatible datastructure, meaning that it is a nested datastructure of those primitive data types (mappings/dicts, arrays, integers, floats, strings, booleans) supported by JSON. Currently, the file itself may be in JSON format (if the filename ends in .json or in YAML (if the filename ends in .yaml or .yml). YAML is the preferred default format, as it is (arguably) more human-friendly, and supports inline comments.

Several of the configuration formats in DNADNA have top-level key/value pairs in which the value is another mapping; these are referred to in this documentation as “sections”.

For example, the pre-processing and training config files have sections called learned_params, which specifies the parameters of your simulation on which to train your model. An example of this section in JSON format looks like:

{
  "learned_params": {
    "param1": {
      "type": "regression",
      "log_transform": true
    },
    "param2": {
      "type": "classification",
      "classes": 2
    }
  }
}

The equivalent in YAML (which is used throughout the rest of this documentation, and in the default config files) looks like:

learned_params:
    param1:
        type: regression
        log_transform: true
    param2:
        type: classification
        classes: 2

Path resolution in config files

Several options in DNADNA config files take a file or directory name as a value.

For example the dataset config file, which specifies how DNADNA should load your simulation data, has an option data_root which takes a path to the root directory of your dataset.

This can be specified as an absolute path, but may also be given as a relative path like:

data_root: "."

When DNADNA loads config files it interprets relative paths as relative to the config file. This means that if the dataset config file is in the same directory as your simulation data, it means the directory containing the config file is the data_root.

This format may be preferable, as it means you can move your entire dataset around long with the config file, without having to modify any paths in the config file.

To give a concrete example, if you have a directory structure like:

_ /home/username/data/cow_snps
    \_ cow_snps_dataset_config.yml
    |_ scenario_00000/
    |_ scenario_00001/
    ...

Then if the config file cow_snps_dataset_config.yml contains data_root: "." that means /home/username/data/cow_snps is the root of the simulation data.

Configuration inheritance

DNADNA has its own system for inheritance of config file, where one file can load a portion of its configuration from another file. This feature is unique to DNADNA and not a feature of YAML or JSON.

If any mapping in a config file, whether at the top-level or more deeply nested, contains the special keyword inherit with a config filename as its value, the contents of the inherited config file are loaded into the section containing inherit.

To give an example, if you have base_params.yml containing:

param1:
    type: regression
    log_transform: true
param2:
    type: classification
    classes: 2

and preprocessing_config.yml in the same directory containing:

learned_params:
    inherit: base_params.yml

then when preprocessing_config.yml is loaded by the software, the “inheritance” is resolved, and the resulting configuration is:

learned_params:
    param1:
        type: regression
        log_transform: true
    param2:
        type: classification
        classes: 2

You will see in use in some of the same configuration files generated by DNADNA. For example, the pre-processing config file contains a dataset section which refers to your dataset config. If you have an existing dataset config file and run dnadna init --dataset-config=my_dataset/my_dataset_config.yml my_model then it will output in the generated pre-processing config file:

dataset:
    inherit: ../my_dataset/my_dataset_config.yml

rather than including a verbatim copy of the dataset config file.

Overriding

When using inherit, it is also possible to extend or even override values loaded from the inherited config file. Using the same base_params.yml example as the previous section, if your pre-processing config file contains:

learned_params:
    inherit: base_params.yml
    param2:
        classes: 3
    param3:
        type: regression

the resulting configuration is:

learned_params:
    param1:
        type: regression
        log_transform: true
    param2:
        type: classification
        classes: 3
    param3:
        type: regression

Note that this merges the additional configuration into the inherited base configuration. So param2 remains a “classification” type parameter, but has its number of classes changed from 2 to 3. A new parameter param3 is added.

Overriding without merging

When one configuration inherits from another, there is a merging behavior, where if both configs contain the same property (e.g. learned_params, and the value of that property is a mapping/dict, then in the resulting configuration those values are merged together rather than one overriding the other.

For example, given defaults.yaml:

dataset_splits:
    training: 0.70
    validation: 0.30

learned_params:
    param1:
        type: regression
        log_transform: true
    param2:
        type: classification
        classes: 2

and preprocessing.yaml:

inherit: defaults.yaml
learned_params:
    param3:
        type: regression

The resulting configuration merges together the two learned_params properties like so:

dataset_splits:
    training: 0.70
    validation: 0.30

learned_params:
    param1:
        type: regression
        log_transform: true
    param2:
        type: classification
        classes: 2
    param3:
        type: regression

However, say we wanted to keep everything else from default.yaml, but completely override the learned parameters, like:

inherit: defaults.yaml
learned_params:
    param_A:
        type: regression
    param_B:
        type: regression

This won’t have the desired effect because the default behavior is to also inherit param1 and param2 from the base config. If you want to completely override a value, append an exclamation mark ! to its property name like:

inherit: defaults.yaml
learned_params!:
    param_A:
        type: regression
    param_B:
        type: regression

This has the effect of completely overriding learned_params from the base config, resulting in:

dataset_splits:
    training: 0.70
    validation: 0.30

learned_params:
    param_A:
        type: regression
    param_B:
        type: regression

Schemas

All of the config file formats in DNADNA are specified by schema declared in JSON Schema format. This specifies all the optional and required options in each config file, and the accepted types of their values.

This is an implementation detail that most users will not need to bother with, but referencing the schemas can help to better understand the config formats, since when the DNADNA software loads your config file they are validated against the schemas.

The configuration schemas are documented in more detail in the Configuration Schemas section.