Configuration Format
DNADNA is configured through a few different config files, typically with one config file associated with each stage of the processing pipeline (simulation, pre-processing, training).
This page does not document the format of each config file in detail. These sill be documented in the chapters associated with individual processing stage. Rather, here we document the general structure and features of the configuration format.
Configuration structure
Each DNADNA configuration is in the form of a nested mapping/dictionary/hash table mapping string keywords to values, where each value may itself be a mapping.
It is a JSON-compatible datastructure, meaning that it is a nested
datastructure of those primitive data types (mappings/dicts, arrays,
integers, floats, strings, booleans) supported by JSON. Currently, the
file itself may be in JSON format (if the filename ends in .json
or in
YAML (if the filename ends in .yaml
or .yml
). YAML is the
preferred default format, as it is (arguably) more human-friendly, and
supports inline comments.
Several of the configuration formats in DNADNA have top-level key/value pairs in which the value is another mapping; these are referred to in this documentation as “sections”.
For example, the pre-processing and training config files have sections
called learned_params
, which specifies the parameters of your simulation
on which to train your model. An example of this section in JSON format
looks like:
{
"learned_params": {
"param1": {
"type": "regression",
"log_transform": true
},
"param2": {
"type": "classification",
"classes": 2
}
}
}
The equivalent in YAML (which is used throughout the rest of this documentation, and in the default config files) looks like:
learned_params:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 2
Path resolution in config files
Several options in DNADNA config files take a file or directory name as a value.
For example the dataset config file,
which specifies how DNADNA should load your simulation data, has an option
data_root
which takes a path to the root directory of your dataset.
This can be specified as an absolute path, but may also be given as a relative path like:
data_root: "."
When DNADNA loads config files it interprets relative paths as relative to
the config file. This means that if the dataset config file is in the same
directory as your simulation data, it means the directory containing the
config file is the data_root
.
This format may be preferable, as it means you can move your entire dataset around long with the config file, without having to modify any paths in the config file.
To give a concrete example, if you have a directory structure like:
_ /home/username/data/cow_snps
\_ cow_snps_dataset_config.yml
|_ scenario_00000/
|_ scenario_00001/
...
Then if the config file cow_snps_dataset_config.yml
contains
data_root: "."
that means /home/username/data/cow_snps
is the root
of the simulation data.
Configuration inheritance
DNADNA has its own system for inheritance of config file, where one file can load a portion of its configuration from another file. This feature is unique to DNADNA and not a feature of YAML or JSON.
If any mapping in a config file, whether at the top-level or more deeply
nested, contains the special keyword inherit
with a config filename as
its value, the contents of the inherited config file are loaded into the
section containing inherit
.
To give an example, if you have base_params.yml
containing:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 2
and preprocessing_config.yml
in the same directory containing:
learned_params:
inherit: base_params.yml
then when preprocessing_config.yml
is loaded by the software, the
“inheritance” is resolved, and the resulting configuration is:
learned_params:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 2
You will see in use in some of the same configuration files generated by
DNADNA. For example, the pre-processing config file contains a dataset
section which refers to your dataset config. If you have an existing
dataset config file and run dnadna init
--dataset-config=my_dataset/my_dataset_config.yml my_model
then it will
output in the generated pre-processing config file:
dataset:
inherit: ../my_dataset/my_dataset_config.yml
rather than including a verbatim copy of the dataset config file.
Overriding
When using inherit
, it is also possible to extend or even override
values loaded from the inherited config file. Using the same
base_params.yml
example as the previous section, if your pre-processing
config file contains:
learned_params:
inherit: base_params.yml
param2:
classes: 3
param3:
type: regression
the resulting configuration is:
learned_params:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 3
param3:
type: regression
Note that this merges the additional configuration into the inherited base
configuration. So param2
remains a “classification” type parameter, but
has its number of classes changed from 2 to 3. A new parameter param3
is added.
Overriding without merging
When one configuration inherits from another, there is a merging behavior,
where if both configs contain the same property (e.g. learned_params
,
and the value of that property is a mapping/dict, then in the resulting
configuration those values are merged together rather than one overriding
the other.
For example, given defaults.yaml
:
dataset_splits:
training: 0.70
validation: 0.30
learned_params:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 2
and preprocessing.yaml
:
inherit: defaults.yaml
learned_params:
param3:
type: regression
The resulting configuration merges together the two learned_params
properties like so:
dataset_splits:
training: 0.70
validation: 0.30
learned_params:
param1:
type: regression
log_transform: true
param2:
type: classification
classes: 2
param3:
type: regression
However, say we wanted to keep everything else from default.yaml
, but
completely override the learned parameters, like:
inherit: defaults.yaml
learned_params:
param_A:
type: regression
param_B:
type: regression
This won’t have the desired effect because the default behavior is to also
inherit param1
and param2
from the base config. If you want to
completely override a value, append an exclamation mark !
to its
property name like:
inherit: defaults.yaml
learned_params!:
param_A:
type: regression
param_B:
type: regression
This has the effect of completely overriding learned_params
from the
base config, resulting in:
dataset_splits:
training: 0.70
validation: 0.30
learned_params:
param_A:
type: regression
param_B:
type: regression
Schemas
All of the config file formats in DNADNA are specified by schema declared in JSON Schema format. This specifies all the optional and required options in each config file, and the accepted types of their values.
This is an implementation detail that most users will not need to bother with, but referencing the schemas can help to better understand the config formats, since when the DNADNA software loads your config file they are validated against the schemas.
The configuration schemas are documented in more detail in the Configuration Schemas section.