Usage Overview
There are five main stages in training a model with DNADNA, each of which are run independently, and some of which are optional:
Simulation (optional) – generate a simulated SNP dataset on which to train models.
Training initialization – sets up output directories and example configuration files.
Preprocessing – checks your dataset and filters out scenarios that don’t meet minimal criteria required for your training run (e.g. minimal number of SNPs).
Training – trains a neural net on your data, outputting a PyTorch network with the optimized parameters.
Prediction – evaluate your trained network on new data in order to make predictions and further evaluate the efficacy of your model.
Each of these steps corresponds with a sub-command of the dnadna
command-line interface:
dnadna simulation
dnadna init
dnadna preprocess
dnadna train
dnadna predict
If you already have an existing dataset in the DNADNA data format you can skip straight to running dnadna init
.
Many of these commands have an associated configuration file allowing extensive customization of their options. For example the training config file is where you select which network to train your model on, parameters to that network, and many other details.
Each config file is documented in the corresponding documentation sections, though for an overview of the configuration file syntax and structure see the Configuration Format documentation.
Simulation
The “simulation” step is purely optional. Currently it is mostly useful for adapting new or existing simulation code to output data in the DNADNA format. There is no significant simulation code built into DNADNA, but rather simulators are provided as plugins confirming to a simple interface specification.
There is a built-in example simulator in then dnadna.examples.one_event
module.
Another usage of the simulation interface may be as a converter: Since the simulator interface outputs data in the DNADNA format, a “simulator” may be written which reads some dataset in from another data format, and outputs it to the DNADNA format.
More details are provided in the Using and Implementing Simulators documentation.
Initialization
To get started DNADNA needs a few things:
An existing dataset in the DNADNA format on which to train the model.
A dataset config file giving the software additional details of the dataset.
A directory to which files associated with each training run will be output (e.g. the trained model, log files, etc.).
A preprocessing config file to pass to the next stage, Preprocessing.
Item 1 is obtained either by using an existing published dataset (possibly reformatted into the correct format) or by running a DNADNA Simulator. This also outputs a dataset config file that can be used.
The dnadna init
command helps with item 2 through 4. It creates a
directory for outputs of your training runs, and generates an example
preprocessing config file that you can then adapt to the specifics of your
dataset and training objectives. It will also output an example dataset
config file if you do not already have one.
If you have an existing dataset config you can pass it to dnadna init
like:
$ dnadna init --dataset-config=path/to/my_simulation_dataset_config.yml my_model
which would output the file
my_model/my_model_preprocessing_config.yml
which can then be
further edited by hand.
The model name (my_model
in the above example) is used mostly for naming
the output directory, config file names, and some log messages.
Otherwise you can run:
$ dnadna init my_model
which outputs my_model/my_model_dataset_config.yml
and
my_model/my_model_preprocessing_config.yml
.
If you would like to create the output directory somewhere other than the
current working directory, the last argument to dnadna init
is an
optional root directory:
$ dnadna init my_model /mnt/nfs/username/models
would output config files to /mnt/nfs/username/models/my_model/
.
Preprocessing
The preprocessing step performs the following:
validating input files and filtering out scenarios that do not match minimal requirements (defined by users)
splitting the dataset into training/validation/test sets (the latter is optional)
applying transformations to target parameter(s) if required by users (e.g. log transformation)
standardizing target parameter(s) for regression tasks (the mean and standard deviation used in standardization are computed based on the training set only).
Preprocessing is necessary before performing the first training run and should be re-run if and only if one of the following is true:
the dataset changed,
the task changed (e.g. predicting other parameters or the same parameters but with different transformations),
the required input dimensions changed (e.g. to match the dimensions expected by some networks).
At this stage we expect the user to open my_model_preprocessing_config.yml
and edit the properties to match the task/network needs in terms of minimal
number of SNPs and individuals required for a dataset to be valid, names of the
evolutionary parameters to be targeted, split proportions, etc. More details
are provided in the dedicated preprocessing page.
Once the preprocessing configuration file has been filled and the required input files are created, run preprocessing with:
$ dnadna preprocess my_model_preprocessing_config.yml
which outputs my_model/my_model_training_config.yml
,
my_model/my_model_preprocessed_params.csv
and
my_model/my_model_preprocessing.log
.
The latter is simply a log file. my_model_preprocessed_params.csv
is a
parameter table similar to my_model_params.csv
but with log-transformed (if
required) and standardized target parameters, and with an additional column
indicating the assignment of each scenario to training, validation or test sets.
Note that all replicates of a scenario are assigned to the same class.
my_model/my_model_training_config.yml
will be described in the next section.
More details on the dedicated preprocessing page.
Training
We can now proceed to training. It consists of optimizing the parameters of a statistical model (here the weights of a network) based on a training dataset and optimization hyperparameters, and evaluating the performance on a validation set.
First edit my_model/my_model_training_config.yml
to define, in
particular, which network should be trained, its hyperparameters and loss
function, the optimization hyperparameters, transformation for data
augmentation, etc. More details on the dedicated training page.
Then run:
$ dnadna train my_model_name_training_config.yml
which creates a subdirectory run_{run_id}/
containing the optimized network
my_model_run_{run_id}_best_net.pth
as well as checkpoints during training, a
log file and loss values stored in a tensorboard directory.
dnadna train
takes additional arguments such as:
--plugin PLUGIN
to pass plugin files that define custom networks, optimizers or transformation that we would like to use for training despite them not being in the original dnadna code. See dedicated plugin page.-r RUN_ID
or--run-id RUN_ID
to specify a run identifier different from the one created by default (the default starts at run_000 and then monotonically increases to run_001 etc.). RUN_ID can also be specified in the config file.--overwrite
to overwrite the previous run (otherwise, create a new run directory).
More details on the dedicated training page.
Prediction
Once trained, a network can be applied to a dataset in DNADNA dataset format to classify/predict its evolutionary parameters. The following command is used:
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/dataset.npz
This will use the best net, but you can use any net name, such as run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth
.
This outputs the predictions in CSV format which is printed to standard out
by default while the process runs. You can pipe this to a file using
standard shell redirection operators like dnadna predict {args} >
predictions.csv
, or you can specify a file to output to using the
--output
option.
You can also apply dnadna predict
to multiple npz files as follows:
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {dataset_dir}/scenario*/*.npz
where {dataset_dir}
is a directory (that you created) containing
independent simulations which will serve as test for all networks or as
illustration of predictive performance under specific conditions.
Importantly if you want to ensure that target examples comply to the
preprocessing constraints (such as the minimal number of SNPs and individuals)
use --preprocess
. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs).
More details on the dedicated prediction page.