Simulation Datasets
Simulation data used by DNADNA consists of two components, an SNP matrix and a positions array; one of each for each simulation.
The SNP matrix is a 2-D array, typically arranged such that the number of rows is the number of individuals sampled in the simulation, and the number of columns is the number of SNPs. The matrix contains boolean data, with a 1 when a variation is present, and a 0 otherwise.
The positions array has length equal to the number of columns in the associated SNP matrix. That is, it is the positions within a genome where the SNPs occur. The positions array may come in one of a number of different normalizations, such as absolute positions within a genome, or relative positions (distance to the previous SNP). They may also be whole numbers, or re-normalized to a floating point value in the range [0.0, 1.0). The format of position in your simulated data can be specified thanks to the properties
distance:True/False
andnormalized:True/False
in the dataset config file.
Note
Individual simulations are represented internally by the
SNPSample
class, and the SNP matrix and position
array are returned as PyTorch Tensors
by the
snp
and
pos
attributes respectively.
The DNADNA Dataset Format
DNADNA trains its models on large collections of simulation data, which is assumed to be organized on disk in the “DNADNA Format”. Although most of the internal API is agnostic to how simulation data is located and loaded, the high-level user interface currently assumes this format by default. Support for using arbitrary data sources via plugins will be available in a future version.
The DNADNA Format assumes that each SNP matrix and associated positions
array are stored together in a single NPZ file, either compressed or
uncompressed, under the 'SNP'
and 'POS'
keys respectively. The SNP
matrix is a 2-D array of 8-bit unsigned integer type, and the positions
array may consist of integers or single- or double-precision floats.
Each simulation has associated with it one or more scenario parameters (see the scenario parameters table below) which are known parameters, specific to the simulation. This scenario parameters table will be used while training your neural net to provide target values for parameters in your training and validation sets.
Each simulation dataset includes one or more scenarios which represent different a specific combination of scenario parameters which were used for simulations in that scenario. And each scenario may consist of one or more replicates, which are multiple copies of simulations with the same parameters, but differently randomized.
To summarize, each simulation dataset consists of:
One or more scenarios containing one or more simulation replicates, with one NPZ file per-replicate, per-scenario
A table of scenario parameters with one row for each scenario
A simulation config file containing additional metadata about the simulation dataset, and configuration on how to read it.
Filesystem layout
Although the DNADNA Format allows some customization as to how simulation files are organized on disk, by default it assumes a layout that looks like:
_ simulations/ # root directory of all simulation datasets; arbitrary name
\_ model-A/
\_ model-A_params.csv # the scenario parameters table
|_ model-A_simulation_config.yml # the simulation config file
|_ scenario_000/
\_ model-A_000_00.npz # scenario 0 replicate 0
...
|_ model-A_000_NN.npz
|_ scenario_001/
\_ model-A_001_00.npz # scenario 1 replicate 0
...
|_ model-A_001_NN.npz
...
|_ scenario_NNN/
\_ model-A_NNN_00.npz
...
|_ model-A_NNN_NN.npz
Note
Scenarios and replicates are always assumed to be numbered starting from zero, up to the number of scenarios/replicates minus one, for standard C/Python-style indexing.
Simulation and replicate numbers in filenames are typically zero-padded with the largest number of zeros to make the filesize consistent across the dataset, but this is optional.
The scenario params table
The scenario parameters table, or just “scenario params” for short lists the simulation parameters of each scenario in the dataset. It is currently assumed to be a CSV file containing at a minimum three columns (including a header container containing the column names):
scenario_idx
- this is the scenario number to which these parameters apply (integer)<parameter>
- at least one parameter, the name of which may be anything exceptscenario_idx
orn_replicates
, and is particular to the dataset (float or integer)n_replicates
- for each scenario, the number of replicates in that scenario; this may or may not be the same for all scenarios (integer)
Any number of additional columns are allowed, each of which are assumed to be a different simulation parameter. Example:
scenario_idx,mutation_rate,recombination_rate,event_time,recent_size,event_size,segment_length,n_samples,n_replicates
0,1e-08,1e-08,0.3865300203456797,-0.497464473948751,-0.5251379654844626,2000000,50,100
1,1e-08,1e-08,0.19344551118300793,0.16419897912977574,0.22637196776073554,2000000,50,100
...
The dataset config file
The dataset configuration, just “dataset config” for short contains some required additional metadata about the dataset, as well as optional parameters for customizing how the dataset is formatted, or additional metadata specific to the simulator that output the simulation. It can also be referred to as the “simulation config” in the latter case.
The dataset config is also always validated against a schema which both documents the format, and ensures the correctness of config files.
Here is an example, mostly minimal simulation config file:
1data_root: .
2dataset_name: generic
3scenario_params_path: scenario_params.csv
4data_source:
5 format: dnadna
6 filename_format: "scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz"
The data_root
property gives the root directory for all files in the
dataset. This is typically just the relative directory .
, which means
the same directory as the simulation config file itself, which is the
default layout for the DNADNA format.
You can also see that the data_source.filename_format
gives a template
for .npz
filenames which matches with the default filesystem
layout of the DNADNA Format. This property may
be omitted from the config file if the default format is used, but this
demonstrates that the filename format allows some customization.
Alternative dataset formats
As previously mentioned, although the DNADNA Format is used by default, alternative dataset formats are supported at the API level, and may be supported at the high-level via config files in the future.
In the meantime, if you have msprime or SLiM
scripts for your simulations, you can easily save the tree sequence object
into DNADNA format npz files by adding the following lines into your
scripts:
# If tree_sequence was simulated for multiple replicate of a given scenario
# scenario corresponds to a number indexing one set of parameter values
# cf scenario params table
for replicate, ts in enumerate(tree_sequence):
snps = ts.genotype_matrix().T.astype(np.uint8)
pos = np.round(ts.tables.sites.position).astype(np.int)
# Default layout (can be changed)
filename = f"scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz"
np.savez_compressed(os.path.join(outdir, filename), SNP=snps, POS=pos)
Note that the default layout for the DNADNA format can be changed to suit your wishes, e.g. you could change to:
filename = f"{dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz"
In which case you will update filename_format
in the
dataset config file:
data_source:
# string template for per-replicate simulation files in Python
# string template format; the following template variables may be
# used: 'name', the same as the name property used in this config
# file; 'scenario', the scenario number, and 'replicate', the
# replicate number of the scenario (if there are multiple
# replicates); path separators may also be used in the template to
# form a directory structure
filename_format: {dataset_name}/scen_{scenario}_arbitrary_text/rep_{replicate}/{scenario}_{replicate}.npz
before running:
$ dnadna init --dataset-config={dataset_name}/{dataset_name}_dataset_config.yml {model_name}
where {dataset_name}/{dataset_name}_dataset_config.yml
is the name you
picked for the config file.
You can check our notebook
for an illustration of a simple constant demographic scenario in msprime
saved as DNADNA format.
If you want to run several simulations or share your simulation code with other DNADNA users, it may be useful to adapt your existing simulation code to a Simulator plugin for DNADNA.
Note
msprime
outputs positions in bp, i.e. not normalized between 0 and 1, and
chromosomes are not circular (contrary to some bacterial simulators), thus
in the dataset config file you
will indicate:
position_format:
distance: false
normalized: false
circular: false