Tutorials

Quickstart Tutorial 1

We strongly advise to start with Quickstart Tutorial that guides you through the whole pipeline, from the beginining (simulations) to the end (predictions).

Quickstart Tutorial 2: Solve a classification task

Reusing the existing dataset

We reuse the simulation scheme of the first quickstart tutorial, but instead will solve a classification task, which is predicting whether the population underwent an expansion or a decline of the population size. We first simulate more data to help the classifier to learn.

The only modification in the simulation config file compared to the first quickstart tutorial is thus:

# my_dataset/my_dataset_simulation_config.yml
data_root: .
n_scenarios: 200
n_replicates: 10
...

It will simulate 2000 populations.

From there, we will create a parameter table based on the previous one, in order to label simulations with or without decline. In the previous tutorial, we simulated a population whose size varies once at some point in time. Let’s start from the same directory where we performed the quickstart tutorial:

$ ls -l
my_dataset
my_model

First, we create a new parameter table containing a categorical variable (indicating expansion or decline) based on the previous table. For that we check whether recent_size is smaller than event_size (thus indicating a decline of the population size). We also keep the columns scenario_idx and n_replicates that are required by dnadna:

$ awk -F"," 'BEGIN{print "scenario_idx,decline,n_replicates"}NR>1{print $1","($5 < $6)","$7}' my_dataset/one_event_params.csv > my_dataset/one_event_params_classif.csv

In my_dataset/one_event_params_classif.csv, the column decline is 1 when there is a decline in population size and 0 otherwise.

Second, we create the corresponding dataset configuration file. We (i) copy the previous dataset configuration file:

$ cp my_dataset/my_dataset_simulation_config.yml my_dataset/my_dataset_simulation_config_classif.yml

and (ii) modify the scenario_params_path to one_event_params_classif.csv in the file my_dataset_simulation_config_classif.yml

Initializing the model and defining the task

We initialize a new model folder, in which all configurations, models and experiments made to train networks solving the classification task will be stored:

dnadna init --simulation-config=my_dataset/my_dataset_simulation_config_classif.yml model_classification

We edit the preprocessing configuration file to let dnadna know what is the task and on which parameter. The learned_params section becomes:

# model_classification/model_classification_preprocessing_config.yml

# description of the parameters the network will be trained on
learned_params:
    decline:
        type: classification
        classes: 2

We also create a test set, in the dataset_splits section. The sum of the sets’ proportions should be smaller or equal to 1. If smaller, only a subpart of the dataset is used.

# model_classification/model_classification_preprocessing_config.yml

dataset_splits:
    # portion of the dataset to use for training
    training: 0.7

    # portion of the dataset to use for validation
    validation: 0.2

    test: 0.1

We now run the preprocessing step:

dnadna preprocess model_classification/model_classification_preprocessing_config.yml

Training

After preprocessing, dnadna generates a training configuration file (here automatically named model_classification/model_classification_training_config.yml), and automatically sets the classification loss to Cross Entropy. This can be modified to any other pytorch loss suitable for classification.

We modify the network to use SPIDNA whose architecture is adapted to genetic data:


# model_classification/model_classification_training_config.yml

network:

name: SPIDNA

# net parameters for CNN params:

n_features: 50 n_blocks: 6

The training is similar to the regression task (check documentation for all available options):

dnadna train model_classification/model_classification_training_config.yml

You can then visualize losses using TensorBoard.

Prediction

To perform prediction on a dataset (ie classify datasets into decline or expansion), the command is:

dnadna predict model_classification/run_000/model_classification_run_000_best_net.pth my_dataset/scenario_00/my_dataset_00_*
path,decline_0,decline_1,decline
my_dataset/scenario_000/my_dataset_000_0.npz,0.04335615783929825,0.95664381980896,1
my_dataset/scenario_000/my_dataset_000_1.npz,0.2649589776992798,0.7350410223007202,1
my_dataset/scenario_000/my_dataset_000_2.npz,0.5666247010231018,0.4333752691745758,0
my_dataset/scenario_000/my_dataset_000_3.npz,0.19370636343955994,0.8062936067581177,1
my_dataset/scenario_000/my_dataset_000_4.npz,0.1597474366426468,0.8402525782585144,1
my_dataset/scenario_000/my_dataset_000_5.npz,0.5396643280982971,0.46033570170402527,0
my_dataset/scenario_000/my_dataset_000_6.npz,0.13689160346984863,0.8631084561347961,1
my_dataset/scenario_000/my_dataset_000_7.npz,0.14995048940181732,0.8500494956970215,1
my_dataset/scenario_000/my_dataset_000_8.npz,0.25460949540138245,0.7453904747962952,1
my_dataset/scenario_000/my_dataset_000_9.npz,0.3254058063030243,0.6745941638946533,1

In addition to the predicted class (column decline), dnadna returns the softmax value for each of the class (decline_0 for expansion, and decline_1 for decline). This value is the probability that the input data is part of a given class. The predicted class is the one with the highest softmax value.

Using these settings on the test set will lead to an accuracy of about 85%. The next steps to try improving performances will be to generate a larger training dataset and to experiment with different networks and/or hyperparameters. The floor is yours!

Quickstart Tutorial 3: Train SPIDNA on an existing dataset

Description of the existing dataset

Now that you are familiar with the basic usage of dnadna (Quickstart Tutorial), we will give a slightly more complex example starting from an already simulated dataset without using the one_event pre-filled template.

First download and decompress a small dataset available from our datasets’ repo. You can do it from your web browser and decompress the file from your file system with your favorite tool; or you can run the following commands in your terminal:

$ wget https://gitlab.inria.fr/ml_genetics/public/datasets/-/raw/main/toy_data.tar.gz
$ tar -xzf toy_data.tar.gz

You can list the directory:

$ ls -l toy_data
scen_00
scen_01
...
scen_19
toy_data_params.csv

And subdirectories:

$ ls -l toy_data/scen_00/
00_0.npz
00_1.npz
00_2.npz

And check the parameters of your simulations:

$ less toy_data/toy_data_params.csv
scenario_idx,mutation_rate,recombination_rate,event_time,recent_size,event_size,n_replicates,n_samples,segment_length
0,1e-08,1e-08,1789,16003,16003,3,50,2000000
1,1e-08,1e-08,181,5811,34937,3,50,2000000
...
19,1e-08,1e-08,392,25102,14392,3,50,2000000

From this it seems that you have 20 scenarios (0 to 19) each with 3 replicates (independent genomic regions). The mutation and recombination rates, number of replicates and samples, and region length have fixed values, while event_time, recent_size and event_size are varying and describing the demographic model.

With this information you can create your dataset config file. Let’s call it toy_data_dataset_config.yml and simply write the following information in it:

# toy_data/toy_data_dataset_config.yml
# ...
data_root: ./toy_data
dataset_name: toy_data_QS2
scenario_params_path: toy_data/toy_data_params.csv

data_source:
    format: dnadna
    filename_format: "scen_{scenario}/{scenario}_{replicate}.npz"

You can see that filename_format: "scen_{scenario}/{scenario}_{replicate}.npz" is used to match the name formatting of the toy dataset, e.g. scen_00/00_0.npz, which differs from dnadna’s default naming scheme: scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz

Second, data_root needs the path to the data, which is ./toy_data (or any other path to where it was downloaded earlier). The dataset_name parameter describes the dataset we are using. It won’t be used afterwards. It is only important if you use the default filename format (which contains {dataset_name} as a variable).

$ ls -l
toy_data
toy_data_dataset_config.yml
toy_data.tar.gz

You can now initialize a new set of experiments; advisably it will gather all networks based on the same preprocessing of the data (filtering done once) and solving the same task.

$ dnadna init --dataset-config toy_data_dataset_config.yml toy_task1

This initializes your experiment named toy_task1 in a folder with the same name. It is the model_name that will be used for naming all the logs, config files and trained networks. You can specify where to create this folder by adding another parameter after it, the default is the current directory ..

Task definition and preprocessing

The previous command created the file toy_task1/toy_task1_preprocessing_config.yml in which you will specify the task to be solved. For example, to train regression models predicting the three demographic parameters, replace:

# toy_task1/toy_task1_preprocessing_config.yml
# ...
# description of the parameters the network will be trained on
learned_params:
    param1:
        type: regression
        loss_func: MSE
        loss_weight: 1
        log_transform: false
        tied_to_position: false
    param2:
        type: classification
        classes: 2
        loss_func: Cross Entropy
        loss_weight: 1

with:

# toy_task1/toy_task1_preprocessing_config.yml
# ...
# description of the parameters the network will be trained on
learned_params:
-   event_time:
        type: regression
        log_transform: false
        loss_func: MSE
        loss_weight: 1
        tied_to_position: false
-   recent_size:
        type: regression
        log_transform: true
        loss_func: MSE
        loss_weight: 1
        tied_to_position: false
-   event_size:
        type: regression
        log_transform: true
        loss_func: MSE
        loss_weight: 1
        tied_to_position: false

Note that here we asked for the population sizes to be log-transformed. The targeted parameters will be standardized during preprocessing.

In the same config file you can detail filtering steps and training/validation/test splits, see Data preprocessing documentation for more details.

Now run:

$ dnadna preprocess toy_task1/toy_task1_preprocessing_config.yml

Network definition and training

The previous command line outputted toy_task1/toy_task1_training_config.yml that you can edit to specify anything related to training (network, optimizer, …). To train a SPIDNA model, replace the network currently in the config file with:

# toy_task1/toy_task1_training_config.yml
# ...
network:
    name: SPIDNA
    # net parameters
    params:
      n_blocks: 7
      n_features: 50

SPIDNA can handle batches of varying size. For this demo however, we will enforce cropping to the 400 first SNPs only; add or update the max_snp parameter of the crop function in toy_task1/toy_task1_training_config.yml as follows:

# toy_task1/toy_task1_training_config.yml
# ...
dataset_transforms:
-   crop:
        max_snp: 400
        max_indiv: null
        keep_polymorphic_only: true
-   snp_format: concat
-   validate_snp:
        uniform_shape: false

Since we had not enforced min_snp to 400 in the preprocessing config file, some replicates might have less than 400 SNPs. Those are currently padded to reach 400 when creating batches. To avoid this behavior you can set the batch_size to 1 (although this will substantially slow done training). In the latter case, since min_snp was not set, each of this bacth of size 1 might have a different input size; which SPIDNA can handle, contrary to some neural networks (such as a completely fully connected one).

Finally, you can increase the number of epochs (n_epochs) and the number of batches processed between each validation step (evaluation_interval).

See Model training documentation for more details on available training options (number of epochs, evaluation interval, …).

To train this first model, run:

$ dnadna train toy_task1/toy_task1_training_config.yml

Repeat as many training runs as desired after changing the parameters described in toy_task1_training_config.yml (the full config file is saved within each run directory for reproducibility).

At any step, visualize the training and validation losses with:

$ tensorboard --logdir toy_task1/

For using the trained network on specific datasets, see Prediction documentation

Predicting size fluctuations with a pre-trained network (Sanchez et al. 2020)

We provide a notebook reproducing an example of effective population size history inference performed by the SPIDNA deep learning method described in the paper “Deep learning for population size history inference: design, comparison and combination with approximate Bayesian computation” (Sanchez et al. 2020). You should first install dnadna by following the instructions.

In this notebook, we will simulate SNP data for six scenarios with population size histories defined by hand (e.g. expansion, decline or bottleneck) and use a pretrained version of SPIDNA to reconstruct these histories. Warning: this architecture has been trained using data generated with msprime and the priors described in Sanchez et al. (2020) method section. Therefore, using the same architecture to infer the population size histories from datasets falling outside of this prior might lead to high prediction errors.