Using and Implementing Simulators
For your convenience, DNADNA defines a simple, standardized interface for implementing simulators which output data in the standard DNADNA format.
Use of this feature is entirely optional, but may make it easier to implement new simulations, or adapt existing simulation codes to DNADNA. It could also be used, rather than generating new simulations, to load existing simulation data and adapt it to the DNADNA format.
The dnadna simulation
sub-command can be used from the command-line to
initialize and run simulations:
dnadna simulation init <name> [<simulator>] [<root-dir>]
– initializes a new simulation by creating an output directory for the simulation in the current working directory with the same name as the simulation, and generating an example config file for the simulation.You can also specify which simulator to use. By default there is one built-in simulator–a simple simulator named
OneEventSimulator
which uses msprime which generates a model with one population size change event.Additional simulator implementations can be provided via plugins. Below we will give a brief tutorial on how to implement a similar simulator.
dnadna simulation run <config-file>
– after making any desired edits to the config file generated bydnadna simulation init
(or writing a config file by hand), run the simulation specified in the config file.
Simulation config files
As previously noted in the section on Simulation Datasets, a simulation config
file is just a dataset config file
which may contain additional options/sections specific to the Simulator
plugin, for specifying parameters of the simulation.
For example when running:
$ dnadna simulation init my_simulation one_event
Writing sample simulation config to my_simulation/my_simulation_simulation_config.yml ...
if we output the generated config file, we’ll see some options that appear
in dataset config files like dataset_name
, it also has a
simulator_name
option, as well as several other options that are
specific to the OneEventSimulator
.
It also outputs a plugins
list showing what plugins need to be loaded to
run the simulation, in this case dnadna.examples.one_event
(although
this simulator comes built into the dnadna
package, it is implemented as
an example plugin).
Note, in the below example the ...
are just lines of the output that are
skipped for brevity:
$ cat my_simulation/my_simulation_simulation_config.yml
...
simulator_name: one_event
...
dataset_name: my_simulation
...
n_scenarios: 20000
n_samples: 50
n_replicates: 100
segment_length: 2000000.0
tmax: 100000
tmin: 2000
generation_time: 25
recombination_rate: 1.0e-08
mutation_rate: 1.0e-08
n_min: 3.6989700043360187
n_max: 4.698970004336019
...
plugins:
- dnadna.examples.one_event
Here the options n_scenarios
through n_max
are all options used by
OneEventSimulator
.
msprime Simulator Tutorial
This is a brief tutorial based on how to use msprime to generate simulations from a tree sequence.
This is based on a simplified version of the OneEventSimulator
example
simulator that comes with DNADNA. Clicking the link in the previous
sentence and finding the [source]
link will allow you to view its full
source code.
Implementing a Simulator
plugin for DNADNA essentially requires writing
just two functions:
One function which generates a Pandas
DataFrame
containing all the scenario parameters. This will be used by the simulator to write the scenario params table.One function which generates the simulated SNPs themselves, as
SNPSample
objects. Because simulation datasets can be quite large, this function should be implemented as a generator function. It should yield 3-tuples in the form(scenario_idx, replicate_idx, snp)
where the first two items are the scenario and replicate index of the sample, followed by theSNPSample
object itself.
Before diving in to writing the full plugin (which is implemented as a class) let us focus first just on writing these two functions.
First, start an empty Python file named my_simulator.py
and start it out
with some standard imports:
import msprime
import numpy as np
import pandas as pd
The generate_scenario_params
function
This function only requires Pandas (since it is required to return a
DataFrame
) and a little bit of NumPy used to generate columns of
the DataFrame, including a bit of randomization for each scenario.
The number of rows in our DataFrame will be the number of scenarios to
simulate (indexed by scenario_idx
) and each column will be the
simulation parameters for each scenario.
Our function will take as an argument a dict
containing some settings for
our simulation, for which we also provide some default values:
DEFAULT_SETTINGS = {
'n_scenarios': 3,
'Ne_min': 100,
'Ne_max': 1000,
'sample_size': 50,
'segment_length': 2e4,
'mutation_rate': 5e-7,
'recombination_rate': 1e-8,
'n_replicates': 6
}
def generate_scenario_params(settings=DEFAULT_SETTINGS):
# Create an index of scenarios just numbered starting from 0
# up to n_scenarios - 1
n_scenarios = settings['n_scenarios']
index = pd.Index(np.arange(n_scenarios), name='scenario_idx')
columns = ['Ne', 'sample_size', 'segment_length', 'mutation_rate',
'recombination_rate', 'n_replicates']
scenario_params = {}
# Randomize the population size Ne
scenario_params['Ne'] = np.random.uniform(
low=settings['Ne_min'],
high=settings['Ne_max'],
size=n_scenarios).astype(int)
# Fill the scenarios with the same segment_length, mutation and
# recombination rates, and n_replicates
for param in columns[1:]:
scenario_params[param] = settings[param]
# Wrap everything in a Pandas DataFrame
return pd.DataFrame(scenario_params, columns=columns, index=index)
We can test this in an interactive Python prompt like:
>>> from my_simulator import generate_scenario_params
>>> generate_scenario_params()
Ne sample_size segment_length mutation_rate recombination_rate n_replicates
scenario_idx
0 ... 50 20000.0 5.000000e-07 1.000000e-08 6
1 ... 50 20000.0 5.000000e-07 1.000000e-08 6
2 ... 50 20000.0 5.000000e-07 1.000000e-08 6
We can note that Ne
is randomized on subsequent runs.
The simulate_scenario
function
This is the meat of the simulator, and where we finally use msprime
. It
takes as an input a single scenario from the scenario_params
DataFrame
returned by the generate_scenario_params() function as a tuple returned
by itertuples
. Schematically, like:
for scenario in scenario_params.itertuples():
replicates = simulate_scenario(scenario)
This function will return every replicated simulation for the given scenario. Depending on the number of replicates, this could be quite large, so this function is best implemented as a generator function which iterates over all replicates and yields them one at a time.
The values yielded must be 3-tuples consisting of three values: the scenario
index, the replicate index, and a SNPSample
object representing the
simulated data. SNPSample
is a type in DNADNA used for representing a
simulated SNP matrix, along with its associated positions.
We will see in the example below how to use the
tskit.TreeSequence.genotype_matrix
method to convert the
tskit.TreeSequence
objects returned by msprime into SNP matrices in the
format expected by DNADNA.
Let’s see how this looks:
from dnadna.snp_sample import SNPSample
def simulate_scenario(scenario):
tree_sequences = msprime.simulate(
Ne=int(scenario.Ne),
sample_size=int(scenario.sample_size),
length=int(scenario.segment_length),
recombination_rate=scenario.recombination_rate,
mutation_rate=scenario.mutation_rate,
num_replicates=int(scenario.n_replicates))
for rep_idx, rep in enumerate(tree_sequences):
positions = rep.tables.asdict()['sites']['position']
# Normalize positions to the range [0.0, 1.0)
positions /= rep.sequence_length
# Transpose the genotype_matrix() so that the rows represent
# individuals and the columns represent SNP positions (the format
# expected by DNADNA) and convert to uint8 bytes
snps = rep.genotype_matrix().T.astype(np.uint8)
samp = SNPSample(snp=snps, pos=positions,
pos_format={'normalized': True})
yield (scenario.Index, rep_idx, samp)
Testing the code
We can test the code written so far in an interactive Python prompt (or a small script).
First generate the scenario parameters as before:
>>> from my_simulator import generate_scenario_params, simulate_scenario
>>> scenario_params = generate_scenario_params()
Let’s then test it for a single scenario. We use
itertuples
to return a single scenario:
>>> scenario = next(scenario_params.itertuples())
then pass it to simulate_scenario()
:
>>> replicates = simulate_scenario(scenario)
. Now replicates
is an iterator of (scenario_idx, replicate_idx,
sample)
tuples. We can look at the first one like:
>>> scenario_idx, replicate_idx, sample = next(replicates)
>>> print(f'scenario_idx: {scenario_idx}, replicate_idx: {replicate_idx}')
scenario_idx: 0, replicate_idx: 0
>>> print(sample)
SNPSample(
snp=tensor([[0, 0, 0, ..., 0, 0, 1],
[1, 0, 1, ..., 0, 1, 0],
[0, 0, 0, ..., 1, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 0, 0, 1],
[0, 0, 0, ..., 1, 0, 0]], dtype=torch.uint8),
pos=tensor([0.0020, 0.0213, 0.0259, 0.0273, 0.0332, 0.0349, 0.0399, 0.0415, 0.0538,
...
0.9591, 0.9666, 0.9681, 0.9810, 0.9833, 0.9840], dtype=torch.float64),
pos_format={'normalized': True}
))
Alternatively we can use loops, and use SNPSample.to_npz
to save our
samples as NPZ files in The DNADNA Dataset Format:
>>> scenario_params = generate_scenario_params()
>>> for scenario in scenario_params.iteritems():
... for scenario_idx, replicate_idx, sample in simulate_scenario(scenario):
... filename = f'scenario_{scenario_idx}_{replicate_idx}.npz'
... sample.to_npz(filename)
Tying it together into a Simulator plugin
Creating a simulator plugin for DNADNA requires subclassing the
dnadna.simulator.Simulator
class and providing at a minimum the two
functions implemented above, but as methods of the class.
It is also a good idea to provide default settings for the simulator (as we
did above) which in this case go into the config_default
attribute of
the class.
We can write the class directly into my_simulator.py
in such a way that
it simply wraps the code we’ve already written:
from dnadna.simulator import Simulator
class MySimulator(Simulator):
"""Basic msprime tree sequence simulator from the DNADNA tutorial."""
config_default = DEFAULT_SETTINGS
def generate_scenario_params(self):
return generate_scenario_params(self.config)
def simulate_scenario(self, scenario, verbose=False):
return simulate_scenario(scenario)
Note
Note that we did not write any code that actually saves the simulation
files, as we did when we tested the code above. That’s because that
task is automatically handled by the machinery built into the
Simulator
base class. When run with dnadna simulation run
, this
handles automatically generating simulation files in the
The DNADNA Dataset Format.
To use our plugin with pass the --plugin=my_simulator.py
argument to
dnadna
at the command-line (at least for the simulation init
command). But first we can also see that our plugin loads by running:
$ dnadna --plugin=my_simulator.py simulation --help-simulators
my_simulator: <class '<run_path>.MySimulator'>
Basic msprime tree sequence simulator from the DNADNA tutorial.
Note
The unusual class name <class '<run_path>.MySimulator>
is an
artifact of how the runpy
module is used to load plugins from files.
We can run dnadna simulation init
, but again with the plugin loaded, and
for the second positional argument we pass the simulator name
(my_simulator
):
$ dnadna --plugin=my_simulator.py simulation init example my_simulator
Writing sample simulation config to example/example_simulation_config.yml ...
Edit the config file as needed, then run this simulation with the command:
dnadna simulation run example/example_simulation_config.yml
Note that when running dnadna simulation run
it is not necessary any
longer to pass the --plugin
argument, since the use of the plugin is
written into the config file.
So we can adjust the settings in example/example_simulation_config.yml
to our hearts’ delight and run our simulation.
Going further
There are some additional bits of information you can add to your
Simulator
class which can be useful if you are
frequently using a simulation generated by a custom simulator:
preprocessing_config_default
: Provide default values for some or all of a preprocessing config file. Whendnadna init --simulation-config=
is passed your simulation config file it will use the values from this dict to generate the default preprocessing config file. This can be useful, for example, to generate a meaningfullearned_params
section for parameters of your simulation. See again theOneEventSimulator
class for an example.training_config_default
: Likepreprocessing_config_default
but for default values to put into training config files for training models on this simulation.config_schema
: Provide a schema in JSON Schema format (see ref:configuration-schemas
) for configuring your simulator. This is useful for sharing it with other users, as it can help document what settings your custom simulator accepts in the config file.