Dataset Preprocessing

What does preprocessing do?

The preprocessing step is required for different purposes:

  1. Input file validation

    1. It verifies that each input file exists

    2. It discards simulations that don’t have the correct dimension

  2. Split the dataset into training/validation/test sets (the latter is optional)

  3. Apply transformations to parameters (e.g. log transformation) when required by the user

  4. Standardize target parameter(s) for regression tasks.

This step is necessary before performing the first training run. Depending on the size of the dataset, it can take some time, but this is usually done once per iteration of fine-tuning the hyperparameters in the training step of one or multiple networks. It is necessary to re-run processing on the same dataset only if:

  • The task changes (e.g. predicting other parameters or the same parameters but with different transformations)

  • The required input dimensions change (e.g. to match the dimensions expected by some networks).

How do you configure the preprocessing command?

To run the preprocessing command, a configuration file needs to be completed. This file, as the other configuration files, has two purposes: specifying the parameters the model will be trained to predict, and specifying how the preprocessing script should filter the dataset.

We will discuss the different sections of the configuration file separately, but they all go into the same file.

Input file validation


The paths of the dataset files are given by the dataset configuration section (present in the simulation config file or in its own config file). Along with checking the existence of the files, it will discard scenarios in which at least one of the replicated simulations does not have the correct dimensions (indicated by the min_snp and min_indiv settings in the preprocessing config file).

There are multiple reasons a user might specify these thresholds:

  1. Respecting the constraints of an architecture:

    • If the network’s input has a fixed size across examples, simulations must contain at least min_snp SNPs and min_indiv individuals (and should be truncated to exactly these numbers later using the Crop transform).

    • If the network is invariant to the number of SNPs or individuals, it might still require a minimal input size so that all operations can be correctly applied (such as successive pooling operations that strongly reduce the data size)

  2. Discarding scenarios which are unrealistically small (as determined according to the user’s best judgment).


In the preprocessing block of the config file, e.g.:

    # minimum number of SNPs each sample should have
    min_snp: 500

    # minimum number of individuals in each sample
    min_indiv: 20

Splitting the dataset into training/validation/test sets


To avoid overfitting the network to the dataset, the network is trained on a subset of the data (the training set), and is validated against another part of the dataset that the network did not use for training (the validation set).

While developing the network, we adjust the hyperparameters (type of architecture, batch size, learning rate, etc…) until we have good performance on the validation set. This can indirect overfitting of the validation set. A good practice is to run the trained and fine-tuned network(s) only once on yet another subset of the data (the test set). This is done at the very end of all experiments in order to measure consistency in the results of each experiment.

Because this test set can be stored elsewhere or simulated later, specifying a test set is optional. Remember that the final test set should not be used for hyperparameter optimization.


The current version does not automatically run the network on the test set; this feature will be added in a future release.


In the dataset_splits block of the preprocessing config file, the proportion of the different sets can be specified. If it sums to less than 1.0 (100%) some simulations are not used, and it must not sum to greater than 1.0:

# split the dataset in the different subsets. training and validation
# are mandatory.

    training: 0.6
    validation: 0.2
    test: 0.1

Specify learned parameters and apply transformations


At this step we must specify the parameters learned by the network, and which task we apply to them (regression or classification), as well as whether or not they should be log transformed in the case of regression parameters.


In the learned_params block, we have one sub-block per parameter. This sub-block has the properties describing the corresponding parameter.

        type: regression
        log_transform: true
        type: regression
        log_transform: true
        type: regression
        log_transform: true
        type: classification
            - "no"
            - "yes"

In this example, three demographic parameters (event_time, recent_size and event_size) should be predicted via regression and one parameter (selection) via classification. In the latter case, the classes setting may either be an integer giving the number of classes, or a list of strings giving names to each of the classes.


Currently the class names are not used in any way, and the scenario parameters table is expected to have classes labeled by integers. This will be improved in a future release (see issue #99.

What are the input and output files for the preprocessing step?

The required dataset files are:

  • A directory containing the simulation data files

  • A table in CSV format with the different parameters (in columns) used to simulate the different scenarios (in rows)

  • A dataset configuration file which provides notably the path to the previously mentioned elements

The additional required input is:

  • A preprocessing configuration file, which contains the different parameters that will tell what kind of preprocessing is wanted (described in the previous section as well as the schema specifying the file format.

The first three elements are obtained easily when using the simulation sub-command of dnadna, otherwise, one should create them.

An example preprocessing config file is generated when running the dnadna init command. See Initialization.

The output files produced by this step are:

  • A table <model_name>_preprocessed_params.csv that contains the target parameters, possibly normalized and/or transformed. It also contains for a given scenario, to which set (train/validation/test) it belongs

  • The training configuration file, to modify before doing the next step: dnadna train.

The training configuration file will indicate the mean and standard deviation computed over the training set for each standardized target parameter. Those will be used to unstandardize the predicted values during the prediction step.

It will also contain the date and time of when the preprocessing was done, as well as the version of dnadna used.

Command line

Once the preprocessing configuration file has been filled and the required input files are created, the command to start the preprocessing is simply:

dnadna preprocess <model_name>_preprocessing_config.yml

More details can be found in the Quickstart Tutorial.

Next step

After successful pre-processing you can train your network.