Prediction
What does prediction do?
Once trained a network can be applied (through a simple forward pass) to other datasets, such as:
a test set, after hyperparameter optimization has been done for all networks. It enables to compare fairly multiple networks and check whether they overfitted the validation set,
specific examples, to evaluate predictive performance on specific scenarios or the robustness under specific conditions (such as new data under selection while selection was absent from the training set),
real datasets to reconstruct the past evolutionary history of real populations.
How do you configure the predict command?
Basic usage
Single file
The required arguments for dnadna predict
are:
MODEL: most commonly a path to a .pth file, such as
run_{runid}/my_model_run_{runid}_best_net.pth
, that contains the trained network we wish to use and additional information (such as data transformation that should be applied beforehand and info to unstandardize and/or “untransform” the predicted parameters). Alternatively the final config file of a runrun_{runid}/my_model_run_{runid}_final_config.yml
can be passed (in which case the best network of the given run is used by default).INPUT: path to one or more npz files, or to a dataset config file (describing a whole dataset).
A typical usage will thus be:
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/sample.npz
to classify/predict evolutionary parameters for a single data sample
realdata/sample.npz
in DNADNA dataset format.
This will use the best net, but you can use any net name, such as run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth
.
Predictions are outputed in CSV format and printed to standard out by
default while the process runs. You can specify a file to output to using the
--output
option.
Multiple files
You can also apply dnadna predict to multiple npz files as follows:
$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/*/*.npz
where {extra_dir_name}
is a directory (that you created) containing
independent simulations which will serve as test for all networks or as
illustration of predictive performance under specific conditions.
Config versus .pth files
The previous command is equivalent to:
$ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/scenario*/*.npz
where the training config file is passed rather than the .pth
of the best
network.
You can add the option --checkpoint last_epoch
to use the network at final
stage of training rather than the best one.
$ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/*/*.npz --checkpoint last_epoch
Preprocessing
Importantly if you want to ensure that target examples comply to the
preprocessing constraints (such as the minimal number of SNPs and individuals)
use --preprocess
. In that case, a warning will be displayed for each
rejected scenario, with the reason of rejection (such as the minimal number of
SNPs).
Computing resources
Fine-tune resource usage with the options
--gpus --GPUS
and--loader-num-workers LOADER_NUM_WORKERS
to indicate the specific GPUs and the number of CPUs to use.Display a progress bar with the option
--progress-bar
.
Data transformations
By default the data transformations applied at prediction step are the same as
the ones applied to the training set (defined at training time). However, this
behaviour can be changed. Check what transformation lists are available
using the option -t show
:
$ dnadna predict XXX_final_config.yml sample.npz -t show
or, equivalently,
$ dnadna predict XXX_best_net.pth sample.npz -t show
and then pass the desired transfom list name to the command line, such as -t
validation
(resp. -t test
) to apply the same transformations as for the
validation (resp. test) set. Use -t TRANSFORM_LIST_NAME
if you had define
additional lists in your training config file.
Use -t no
to apply no transformation at prediction step.
Defining new transformation lists
You can define other lists of transformations than the ones defined at training time. For this:
1. create a predict config file in which you define a predict_transforms:
block (similar to dataset_transforms blocks in
training config files, except that all lists must have a name, and this name cannot match ‘*training*’ or
‘*validation*’)
2. give the file path and transform list name as arguments to dnadna predict
using --config PREDICT_CONFIG_PATH -t TRANSFORM_LIST_NAME
.
Example
Create a file
XXX_predict_config.yaml
(preferrably in the same directory as the XXX_training_config.yaml, so that you can easily reuse it for multiple runs).In
XXX_predict_config.yaml
define one or multiple transform lists:
predict_transforms:
real_to_spidna:
# Subsample randomly 50 individuals out of the 'num_sample' simulated
- subsample:
size: 50
# Then crop the matrices to the first 400 SNPs:
- crop:
max_snp: 400
max_indiv: null
keep_polymorphic_only: true
my_predict_2:
# Crop the matrices to the first 30 haploid individuals and 400 SNPs:
- crop:
max_snp: 400
max_indiv: 30
keep_polymorphic_only: true
Here we imagine that the new datasets (for example real data) have a large
number of individuals than what the network was trained on; so`real_to_spidna`
randomly picks 50 haploids and finally crops to the 400 first SNPs. On the other
hand, my_predict_2
is meant to test the robustness of the network (using
simulated data) on different sample sizes than the one used at training, so it
simply crops to the first 30 haploids and 400 SNPs.
Check the available transform lists (shows the ones define in training and predict config files):
$ dnadna predict XXX_best_net.pth {real_dir}/*.npz --config predict_config.yml -t show
Perform prediction:
$ dnadna predict XXX_best_net.pth {real_dir}/*.npz --config predict_config.yml -t real_to_spidna --progress-bar -o results.csv
What are the output files for the predict step?
dnadna predict
outputs the predictions in CSV format which are printed to
standard out by default while the process runs. You can pipe this to a file
using standard shell redirection operators like dnadna predict {args} >
predictions.csv
, or you can specify a file to output to using the --output
option.