Tutorials
Quickstart Tutorial 1
We strongly advise to start with Quickstart Tutorial that guides you through the whole pipeline, from the beginining (simulations) to the end (predictions).
Quickstart Tutorial 2: Solve a classification task
Reusing the existing dataset
We reuse the simulation scheme of the first quickstart tutorial, but instead will solve a classification task, which is predicting whether the population underwent an expansion or a decline of the population size. We first simulate more data to help the classifier to learn.
The only modification in the simulation config file compared to the first quickstart tutorial is thus:
# my_dataset/my_dataset_simulation_config.yml
data_root: .
n_scenarios: 200
n_replicates: 10
...
It will simulate 2000 populations.
From there, we will create a parameter table based on the previous one, in order to label simulations with or without decline. In the previous tutorial, we simulated a population whose size varies once at some point in time. Let’s start from the same directory where we performed the quickstart tutorial:
$ ls -l
my_dataset
my_model
First, we create a new parameter table containing a categorical variable
(indicating expansion or decline) based on the previous table. For that we check
whether recent_size
is smaller than event_size
(thus indicating a decline
of the population size). We also keep the columns scenario_idx
and
n_replicates
that are required by dnadna:
$ awk -F"," 'BEGIN{print "scenario_idx,decline,n_replicates"}NR>1{print $1","($5 < $6)","$7}' my_dataset/one_event_params.csv > my_dataset/one_event_params_classif.csv
In my_dataset/one_event_params_classif.csv
, the column decline
is 1 when
there is a decline in population size and 0 otherwise.
Second, we create the corresponding dataset configuration file. We (i) copy the previous dataset configuration file:
$ cp my_dataset/my_dataset_simulation_config.yml my_dataset/my_dataset_simulation_config_classif.yml
and (ii) modify the scenario_params_path to one_event_params_classif.csv
in the file my_dataset_simulation_config_classif.yml
Initializing the model and defining the task
We initialize a new model folder, in which all configurations, models and experiments made to train networks solving the classification task will be stored:
dnadna init --simulation-config=my_dataset/my_dataset_simulation_config_classif.yml model_classification
We edit the preprocessing configuration file to let dnadna
know what is the
task and on which parameter. The learned_params
section becomes:
# model_classification/model_classification_preprocessing_config.yml
# description of the parameters the network will be trained on
learned_params:
decline:
type: classification
classes: 2
We also create a test set, in the dataset_splits
section. The sum of the sets’
proportions should be smaller or equal to 1. If smaller, only a subpart of the
dataset is used.
# model_classification/model_classification_preprocessing_config.yml
dataset_splits:
# portion of the dataset to use for training
training: 0.7
# portion of the dataset to use for validation
validation: 0.2
test: 0.1
We now run the preprocessing step:
dnadna preprocess model_classification/model_classification_preprocessing_config.yml
Training
After preprocessing, dnadna generates a training configuration file (here
automatically named
model_classification/model_classification_training_config.yml
), and
automatically sets the classification loss to Cross Entropy
. This can be
modified to any other pytorch loss suitable for
classification.
We modify the network to use SPIDNA whose architecture is adapted to genetic data:
# model_classification/model_classification_training_config.yml
- network:
name: SPIDNA
# net parameters for CNN params:
n_features: 50 n_blocks: 6
The training is similar to the regression task (check documentation for all available options):
dnadna train model_classification/model_classification_training_config.yml
You can then visualize losses using TensorBoard.
Prediction
To perform prediction on a dataset (ie classify datasets into decline or expansion), the command is:
dnadna predict model_classification/run_000/model_classification_run_000_best_net.pth my_dataset/scenario_00/my_dataset_00_*
path,decline_0,decline_1,decline
my_dataset/scenario_000/my_dataset_000_0.npz,0.04335615783929825,0.95664381980896,1
my_dataset/scenario_000/my_dataset_000_1.npz,0.2649589776992798,0.7350410223007202,1
my_dataset/scenario_000/my_dataset_000_2.npz,0.5666247010231018,0.4333752691745758,0
my_dataset/scenario_000/my_dataset_000_3.npz,0.19370636343955994,0.8062936067581177,1
my_dataset/scenario_000/my_dataset_000_4.npz,0.1597474366426468,0.8402525782585144,1
my_dataset/scenario_000/my_dataset_000_5.npz,0.5396643280982971,0.46033570170402527,0
my_dataset/scenario_000/my_dataset_000_6.npz,0.13689160346984863,0.8631084561347961,1
my_dataset/scenario_000/my_dataset_000_7.npz,0.14995048940181732,0.8500494956970215,1
my_dataset/scenario_000/my_dataset_000_8.npz,0.25460949540138245,0.7453904747962952,1
my_dataset/scenario_000/my_dataset_000_9.npz,0.3254058063030243,0.6745941638946533,1
In addition to the predicted class (column decline
), dnadna returns the
softmax value for each of the class (decline_0
for expansion, and decline_1
for decline). This value is the probability that the input data is part of a
given class. The predicted class is the one with the highest softmax value.
Using these settings on the test set will lead to an accuracy of about 85%. The next steps to try improving performances will be to generate a larger training dataset and to experiment with different networks and/or hyperparameters. The floor is yours!
Quickstart Tutorial 3: Train SPIDNA on an existing dataset
Description of the existing dataset
Now that you are familiar with the basic usage of dnadna (Quickstart Tutorial),
we will give a slightly more complex example starting from an already simulated
dataset without using the one_event
pre-filled template.
First download and decompress a small dataset available from our datasets’ repo. You can do it from your web browser and decompress the file from your file system with your favorite tool; or you can run the following commands in your terminal:
$ wget https://gitlab.inria.fr/ml_genetics/public/datasets/-/raw/main/toy_data.tar.gz
$ tar -xzf toy_data.tar.gz
You can list the directory:
$ ls -l toy_data
scen_00
scen_01
...
scen_19
toy_data_params.csv
And subdirectories:
$ ls -l toy_data/scen_00/
00_0.npz
00_1.npz
00_2.npz
And check the parameters of your simulations:
$ less toy_data/toy_data_params.csv
scenario_idx,mutation_rate,recombination_rate,event_time,recent_size,event_size,n_replicates,n_samples,segment_length
0,1e-08,1e-08,1789,16003,16003,3,50,2000000
1,1e-08,1e-08,181,5811,34937,3,50,2000000
...
19,1e-08,1e-08,392,25102,14392,3,50,2000000
From this it seems that you have 20 scenarios (0 to 19) each with 3 replicates
(independent genomic regions). The mutation and recombination rates, number of
replicates and samples, and region length have fixed values, while
event_time
, recent_size
and event_size
are varying and describing
the demographic model.
With this information you can create your dataset config file. Let’s call it
toy_data_dataset_config.yml
and simply write the following information in it:
# toy_data/toy_data_dataset_config.yml
# ...
data_root: ./toy_data
dataset_name: toy_data_QS2
scenario_params_path: toy_data/toy_data_params.csv
data_source:
format: dnadna
filename_format: "scen_{scenario}/{scenario}_{replicate}.npz"
You can see that filename_format:
"scen_{scenario}/{scenario}_{replicate}.npz"
is used to match the name
formatting of the toy dataset, e.g. scen_00/00_0.npz
, which differs from
dnadna’s default naming scheme:
scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz
Second, data_root
needs the path to the data, which is ./toy_data
(or
any other path to where it was downloaded earlier). The dataset_name
parameter describes the dataset we are using. It won’t be used afterwards. It is
only important if you use the default filename format (which contains
{dataset_name}
as a variable).
$ ls -l
toy_data
toy_data_dataset_config.yml
toy_data.tar.gz
You can now initialize a new set of experiments; advisably it will gather all networks based on the same preprocessing of the data (filtering done once) and solving the same task.
$ dnadna init --dataset-config toy_data_dataset_config.yml toy_task1
This initializes your experiment named toy_task1
in a folder with the same
name. It is the model_name that will be used for naming all the logs,
config files and trained networks. You can specify where to create
this folder by adding another parameter after it, the default is the current
directory .
.
Task definition and preprocessing
The previous command created the file
toy_task1/toy_task1_preprocessing_config.yml
in which you will specify the
task to be solved. For example, to train regression models predicting the three
demographic parameters, replace:
# toy_task1/toy_task1_preprocessing_config.yml
# ...
# description of the parameters the network will be trained on
learned_params:
param1:
type: regression
loss_func: MSE
loss_weight: 1
log_transform: false
tied_to_position: false
param2:
type: classification
classes: 2
loss_func: Cross Entropy
loss_weight: 1
with:
# toy_task1/toy_task1_preprocessing_config.yml
# ...
# description of the parameters the network will be trained on
learned_params:
- event_time:
type: regression
log_transform: false
loss_func: MSE
loss_weight: 1
tied_to_position: false
- recent_size:
type: regression
log_transform: true
loss_func: MSE
loss_weight: 1
tied_to_position: false
- event_size:
type: regression
log_transform: true
loss_func: MSE
loss_weight: 1
tied_to_position: false
Note that here we asked for the population sizes to be log-transformed. The targeted parameters will be standardized during preprocessing.
In the same config file you can detail filtering steps and training/validation/test splits, see Data preprocessing documentation for more details.
Now run:
$ dnadna preprocess toy_task1/toy_task1_preprocessing_config.yml
Network definition and training
The previous command line outputted toy_task1/toy_task1_training_config.yml
that you can edit to specify anything related to training (network, optimizer,
…). To train a SPIDNA model, replace the network currently in the config file
with:
# toy_task1/toy_task1_training_config.yml
# ...
network:
name: SPIDNA
# net parameters
params:
n_blocks: 7
n_features: 50
SPIDNA can handle batches of varying size. For this demo however, we will
enforce cropping to the 400 first SNPs only; add or update the max_snp
parameter of the crop
function in
toy_task1/toy_task1_training_config.yml
as follows:
# toy_task1/toy_task1_training_config.yml
# ...
dataset_transforms:
- crop:
max_snp: 400
max_indiv: null
keep_polymorphic_only: true
- snp_format: concat
- validate_snp:
uniform_shape: false
Since we had not enforced min_snp
to 400
in the preprocessing config
file, some replicates might have less than 400 SNPs. Those are currently padded
to reach 400 when creating batches. To avoid this behavior you can set the
batch_size
to 1
(although this will substantially slow done training).
In the latter case, since min_snp
was not set, each of this bacth of size 1
might have a different input size; which SPIDNA can handle, contrary to some
neural networks (such as a completely fully connected one).
Finally, you can increase the number of epochs (n_epochs
) and the number of
batches processed between each validation step (evaluation_interval
).
See Model training documentation for more details on available training options (number of epochs, evaluation interval, …).
To train this first model, run:
$ dnadna train toy_task1/toy_task1_training_config.yml
Repeat as many training runs as desired after changing the parameters described
in toy_task1_training_config.yml
(the full config file is saved within each
run directory for reproducibility).
At any step, visualize the training and validation losses with:
$ tensorboard --logdir toy_task1/
For using the trained network on specific datasets, see Prediction documentation
Predicting size fluctuations with a pre-trained network (Sanchez et al. 2020)
We provide a notebook reproducing an example of effective population size history inference performed by the SPIDNA deep learning method described in the paper “Deep learning for population size history inference: design, comparison and combination with approximate Bayesian computation” (Sanchez et al. 2020). You should first install dnadna by following the instructions.
In this notebook, we will simulate SNP data for six scenarios with population size histories defined by hand (e.g. expansion, decline or bottleneck) and use a pretrained version of SPIDNA to reconstruct these histories. Warning: this architecture has been trained using data generated with msprime and the priors described in Sanchez et al. (2020) method section. Therefore, using the same architecture to infer the population size histories from datasets falling outside of this prior might lead to high prediction errors.