Prediction

What does prediction do?

Once trained a network can be applied (through a simple forward pass) to other datasets, such as:

  • a test set, after hyperparameter optimization has been done for all networks. It enables to compare fairly multiple networks and check whether they overfitted the validation set,

  • specific examples, to evaluate predictive performance on specific scenarios or the robustness under specific conditions (such as new data under selection while selection was absent from the training set),

  • real datasets to reconstruct the past evolutionary history of real populations.

How do you configure the predict command?

Basic usage

Single file

The required arguments for dnadna predict are:

  • MODEL: most commonly a path to a .pth file, such as run_{runid}/my_model_run_{runid}_best_net.pth, that contains the trained network we wish to use and additional information (such as data transformation that should be applied beforehand and info to unstandardize and/or “untransform” the predicted parameters). Alternatively the final config file of a run run_{runid}/my_model_run_{runid}_final_config.yml can be passed (in which case the best network of the given run is used by default).

  • INPUT: path to one or more npz files, or to a dataset config file (describing a whole dataset).

A typical usage will thus be:

$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth realdata/sample.npz

to classify/predict evolutionary parameters for a single data sample realdata/sample.npz in DNADNA dataset format.

This will use the best net, but you can use any net name, such as run_{run_id}/my_model_run_{run_id}_last_epoch_net.pth.

Predictions are outputed in CSV format and printed to standard out by default while the process runs. You can specify a file to output to using the --output option.

Multiple files

You can also apply dnadna predict to multiple npz files as follows:

$ dnadna predict run_{run_id}/my_model_run_{run_id}_best_net.pth {extra_dir_name}/*/*.npz

where {extra_dir_name} is a directory (that you created) containing independent simulations which will serve as test for all networks or as illustration of predictive performance under specific conditions.

Config versus .pth files

The previous command is equivalent to:

$ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/scenario*/*.npz

where the training config file is passed rather than the .pth of the best network.

You can add the option --checkpoint last_epoch to use the network at final stage of training rather than the best one.

$ dnadna predict run_{run_id}/my_model_run_{run_id}_final_config.yml {extra_dir_name}/*/*.npz --checkpoint last_epoch

Preprocessing

Importantly if you want to ensure that target examples comply to the preprocessing constraints (such as the minimal number of SNPs and individuals) use --preprocess. In that case, a warning will be displayed for each rejected scenario, with the reason of rejection (such as the minimal number of SNPs).

Computing resources

  • Fine-tune resource usage with the options --gpus --GPUS and --loader-num-workers LOADER_NUM_WORKERS to indicate the specific GPUs and the number of CPUs to use.

  • Display a progress bar with the option --progress-bar.

Data transformations

By default the data transformations applied at prediction step are the same as the ones applied to the training set (defined at training time). However, this behaviour can be changed. Check what transformation lists are available using the option -t show:

$ dnadna predict XXX_final_config.yml sample.npz -t show

or, equivalently,

$ dnadna predict XXX_best_net.pth sample.npz -t show

and then pass the desired transfom list name to the command line, such as -t validation (resp. -t test) to apply the same transformations as for the validation (resp. test) set. Use -t TRANSFORM_LIST_NAME if you had define additional lists in your training config file.

Use -t no to apply no transformation at prediction step.

Defining new transformation lists

You can define other lists of transformations than the ones defined at training time. For this:

1. create a predict config file in which you define a predict_transforms: block (similar to dataset_transforms blocks in training config files, except that all lists must have a name, and this name cannot match ‘*training*’ or ‘*validation*’)

2. give the file path and transform list name as arguments to dnadna predict using --config PREDICT_CONFIG_PATH -t TRANSFORM_LIST_NAME.

Example

  • Create a file XXX_predict_config.yaml (preferrably in the same directory as the XXX_training_config.yaml, so that you can easily reuse it for multiple runs).

  • In XXX_predict_config.yaml define one or multiple transform lists:

predict_transforms:
    real_to_spidna:
        # Subsample randomly 50 individuals out of the 'num_sample' simulated
        - subsample:
            size: 50
        # Then crop the matrices to the first 400 SNPs:
        - crop:
            max_snp: 400
            max_indiv: null
            keep_polymorphic_only: true

    my_predict_2:
        # Crop the matrices to the first 30 haploid individuals and 400 SNPs:
        - crop:
            max_snp: 400
            max_indiv: 30
            keep_polymorphic_only: true

Here we imagine that the new datasets (for example real data) have a large number of individuals than what the network was trained on; so`real_to_spidna` randomly picks 50 haploids and finally crops to the 400 first SNPs. On the other hand, my_predict_2 is meant to test the robustness of the network (using simulated data) on different sample sizes than the one used at training, so it simply crops to the first 30 haploids and 400 SNPs.

  • Check the available transform lists (shows the ones define in training and predict config files):

$ dnadna predict XXX_best_net.pth {real_dir}/*.npz --config predict_config.yml -t show
  • Perform prediction:

$ dnadna predict XXX_best_net.pth {real_dir}/*.npz --config predict_config.yml -t real_to_spidna --progress-bar -o results.csv

What are the output files for the predict step?

dnadna predict outputs the predictions in CSV format which are printed to standard out by default while the process runs. You can pipe this to a file using standard shell redirection operators like dnadna predict {args} > predictions.csv, or you can specify a file to output to using the --output option.