Using and Implementing Simulators

For your convenience, DNADNA defines a simple, standardized interface for implementing simulators which output data in the standard DNADNA format.

Use of this feature is entirely optional, but may make it easier to implement new simulations, or adapt existing simulation codes to DNADNA. It could also be used, rather than generating new simulations, to load existing simulation data and adapt it to the DNADNA format.

The dnadna simulation sub-command can be used from the command-line to initialize and run simulations:

  • dnadna simulation init <name> [<simulator>] [<root-dir>] – initializes a new simulation by creating an output directory for the simulation in the current working directory with the same name as the simulation, and generating an example config file for the simulation.

    You can also specify which simulator to use. By default there is one built-in simulator–a simple simulator named OneEventSimulator which uses msprime which generates a model with one population size change event.

    Additional simulator implementations can be provided via plugins. Below we will give a brief tutorial on how to implement a similar simulator.

  • dnadna simulation run <config-file> – after making any desired edits to the config file generated by dnadna simulation init (or writing a config file by hand), run the simulation specified in the config file.

Simulation config files

As previously noted in the section on Simulation Datasets, a simulation config file is just a dataset config file which may contain additional options/sections specific to the Simulator plugin, for specifying parameters of the simulation.

For example when running:

$ dnadna simulation init my_simulation one_event
Writing sample simulation config to my_simulation/my_simulation_simulation_config.yml ...

if we output the generated config file, we’ll see some options that appear in dataset config files like dataset_name, it also has a simulator_name option, as well as several other options that are specific to the OneEventSimulator.

It also outputs a plugins list showing what plugins need to be loaded to run the simulation, in this case dnadna.examples.one_event (although this simulator comes built into the dnadna package, it is implemented as an example plugin).

Note, in the below example the ... are just lines of the output that are skipped for brevity:

$ cat my_simulation/my_simulation_simulation_config.yml
...
simulator_name: one_event
...
dataset_name: my_simulation
...
n_scenarios: 20000
n_samples: 50
n_replicates: 100
segment_length: 2000000.0
tmax: 100000
tmin: 2000
generation_time: 25
recombination_rate: 1.0e-08
mutation_rate: 1.0e-08
n_min: 3.6989700043360187
n_max: 4.698970004336019
...
plugins:
- dnadna.examples.one_event

Here the options n_scenarios through n_max are all options used by OneEventSimulator.

msprime Simulator Tutorial

This is a brief tutorial based on how to use msprime to generate simulations from a tree sequence.

This is based on a simplified version of the OneEventSimulator example simulator that comes with DNADNA. Clicking the link in the previous sentence and finding the [source] link will allow you to view its full source code.

Implementing a Simulator plugin for DNADNA essentially requires writing just two functions:

  • One function which generates a Pandas DataFrame containing all the scenario parameters. This will be used by the simulator to write the scenario params table.

  • One function which generates the simulated SNPs themselves, as SNPSample objects. Because simulation datasets can be quite large, this function should be implemented as a generator function. It should yield 3-tuples in the form (scenario_idx, replicate_idx, snp) where the first two items are the scenario and replicate index of the sample, followed by the SNPSample object itself.

Before diving in to writing the full plugin (which is implemented as a class) let us focus first just on writing these two functions.

First, start an empty Python file named my_simulator.py and start it out with some standard imports:

import msprime
import numpy as np
import pandas as pd

The generate_scenario_params function

This function only requires Pandas (since it is required to return a DataFrame) and a little bit of NumPy used to generate columns of the DataFrame, including a bit of randomization for each scenario.

The number of rows in our DataFrame will be the number of scenarios to simulate (indexed by scenario_idx) and each column will be the simulation parameters for each scenario.

Our function will take as an argument a dict containing some settings for our simulation, for which we also provide some default values:

DEFAULT_SETTINGS = {
    'n_scenarios': 3,
    'Ne_min': 100,
    'Ne_max': 1000,
    'sample_size': 50,
    'segment_length': 2e4,
    'mutation_rate': 5e-7,
    'recombination_rate': 1e-8,
    'n_replicates': 6
}

def generate_scenario_params(settings=DEFAULT_SETTINGS):
    # Create an index of scenarios just numbered starting from 0
    # up to n_scenarios - 1
    n_scenarios = settings['n_scenarios']
    index = pd.Index(np.arange(n_scenarios), name='scenario_idx')

    columns = ['Ne', 'sample_size', 'segment_length', 'mutation_rate',
               'recombination_rate', 'n_replicates']

    scenario_params = {}

    # Randomize the population size Ne
    scenario_params['Ne'] = np.random.uniform(
        low=settings['Ne_min'],
        high=settings['Ne_max'],
        size=n_scenarios).astype(int)

    # Fill the scenarios with the same segment_length, mutation and
    # recombination rates, and n_replicates
    for param in columns[1:]:
        scenario_params[param] = settings[param]

    # Wrap everything in a Pandas DataFrame
    return pd.DataFrame(scenario_params, columns=columns, index=index)

We can test this in an interactive Python prompt like:

>>> from my_simulator import generate_scenario_params
>>> generate_scenario_params()
               Ne  sample_size  segment_length  mutation_rate  recombination_rate  n_replicates
scenario_idx
0             ...           50         20000.0   5.000000e-07        1.000000e-08             6
1             ...           50         20000.0   5.000000e-07        1.000000e-08             6
2             ...           50         20000.0   5.000000e-07        1.000000e-08             6

We can note that Ne is randomized on subsequent runs.

The simulate_scenario function

This is the meat of the simulator, and where we finally use msprime. It takes as an input a single scenario from the scenario_params DataFrame returned by the generate_scenario_params() function as a tuple returned by itertuples. Schematically, like:

for scenario in scenario_params.itertuples():
    replicates = simulate_scenario(scenario)

This function will return every replicated simulation for the given scenario. Depending on the number of replicates, this could be quite large, so this function is best implemented as a generator function which iterates over all replicates and yields them one at a time.

The values yielded must be 3-tuples consisting of three values: the scenario index, the replicate index, and a SNPSample object representing the simulated data. SNPSample is a type in DNADNA used for representing a simulated SNP matrix, along with its associated positions.

We will see in the example below how to use the tskit.TreeSequence.genotype_matrix method to convert the tskit.TreeSequence objects returned by msprime into SNP matrices in the format expected by DNADNA.

Let’s see how this looks:

from dnadna.snp_sample import SNPSample

def simulate_scenario(scenario):
    tree_sequences = msprime.simulate(
        Ne=int(scenario.Ne),
        sample_size=int(scenario.sample_size),
        length=int(scenario.segment_length),
        recombination_rate=scenario.recombination_rate,
        mutation_rate=scenario.mutation_rate,
        num_replicates=int(scenario.n_replicates))

    for rep_idx, rep in enumerate(tree_sequences):
        positions = rep.tables.asdict()['sites']['position']

        # Normalize positions to the range [0.0, 1.0)
        positions /= rep.sequence_length

        # Transpose the genotype_matrix() so that the rows represent
        # individuals and the columns represent SNP positions (the format
        # expected by DNADNA) and convert to uint8 bytes
        snps = rep.genotype_matrix().T.astype(np.uint8)
        samp = SNPSample(snp=snps, pos=positions,
                         pos_format={'normalized': True})
        yield (scenario.Index, rep_idx, samp)

Testing the code

We can test the code written so far in an interactive Python prompt (or a small script).

First generate the scenario parameters as before:

>>> from my_simulator import generate_scenario_params, simulate_scenario
>>> scenario_params = generate_scenario_params()

Let’s then test it for a single scenario. We use itertuples to return a single scenario:

>>> scenario = next(scenario_params.itertuples())

then pass it to simulate_scenario():

>>> replicates = simulate_scenario(scenario)

. Now replicates is an iterator of (scenario_idx, replicate_idx, sample) tuples. We can look at the first one like:

>>> scenario_idx, replicate_idx, sample = next(replicates)
>>> print(f'scenario_idx: {scenario_idx}, replicate_idx: {replicate_idx}')
scenario_idx: 0, replicate_idx: 0
>>> print(sample)
SNPSample(
    snp=tensor([[0, 0, 0,  ..., 0, 0, 1],
                [1, 0, 1,  ..., 0, 1, 0],
                [0, 0, 0,  ..., 1, 0, 0],
                ...,
                [0, 0, 0,  ..., 0, 0, 1],
                [0, 0, 0,  ..., 0, 0, 1],
                [0, 0, 0,  ..., 1, 0, 0]], dtype=torch.uint8),
    pos=tensor([0.0020, 0.0213, 0.0259, 0.0273, 0.0332, 0.0349, 0.0399, 0.0415, 0.0538,
                ...
                0.9591, 0.9666, 0.9681, 0.9810, 0.9833, 0.9840], dtype=torch.float64),
    pos_format={'normalized': True}
))

Alternatively we can use loops, and use SNPSample.to_npz to save our samples as NPZ files in The DNADNA Dataset Format:

>>> scenario_params = generate_scenario_params()
>>> for scenario in scenario_params.iteritems():
...     for scenario_idx, replicate_idx, sample in simulate_scenario(scenario):
...         filename = f'scenario_{scenario_idx}_{replicate_idx}.npz'
...         sample.to_npz(filename)

Tying it together into a Simulator plugin

Creating a simulator plugin for DNADNA requires subclassing the dnadna.simulator.Simulator class and providing at a minimum the two functions implemented above, but as methods of the class.

It is also a good idea to provide default settings for the simulator (as we did above) which in this case go into the config_default attribute of the class.

We can write the class directly into my_simulator.py in such a way that it simply wraps the code we’ve already written:

from dnadna.simulator import Simulator


class MySimulator(Simulator):
    """Basic msprime tree sequence simulator from the DNADNA tutorial."""

    config_default = DEFAULT_SETTINGS

    def generate_scenario_params(self):
        return generate_scenario_params(self.config)

    def simulate_scenario(self, scenario, verbose=False):
        return simulate_scenario(scenario)

Note

Note that we did not write any code that actually saves the simulation files, as we did when we tested the code above. That’s because that task is automatically handled by the machinery built into the Simulator base class. When run with dnadna simulation run, this handles automatically generating simulation files in the The DNADNA Dataset Format.

To use our plugin with pass the --plugin=my_simulator.py argument to dnadna at the command-line (at least for the simulation init command). But first we can also see that our plugin loads by running:

$ dnadna --plugin=my_simulator.py simulation --help-simulators
my_simulator: <class '<run_path>.MySimulator'>
    Basic msprime tree sequence simulator from the DNADNA tutorial.

Note

The unusual class name <class '<run_path>.MySimulator> is an artifact of how the runpy module is used to load plugins from files.

We can run dnadna simulation init, but again with the plugin loaded, and for the second positional argument we pass the simulator name (my_simulator):

$ dnadna --plugin=my_simulator.py simulation init example my_simulator
Writing sample simulation config to example/example_simulation_config.yml ...
Edit the config file as needed, then run this simulation with the command:

    dnadna simulation run example/example_simulation_config.yml

Note that when running dnadna simulation run it is not necessary any longer to pass the --plugin argument, since the use of the plugin is written into the config file.

So we can adjust the settings in example/example_simulation_config.yml to our hearts’ delight and run our simulation.

Going further

There are some additional bits of information you can add to your Simulator class which can be useful if you are frequently using a simulation generated by a custom simulator:

  • preprocessing_config_default: Provide default values for some or all of a preprocessing config file. When dnadna init --simulation-config= is passed your simulation config file it will use the values from this dict to generate the default preprocessing config file. This can be useful, for example, to generate a meaningful learned_params section for parameters of your simulation. See again the OneEventSimulator class for an example.

  • training_config_default: Like preprocessing_config_default but for default values to put into training config files for training models on this simulation.

  • config_schema: Provide a schema in JSON Schema format (see ref:configuration-schemas) for configuring your simulator. This is useful for sharing it with other users, as it can help document what settings your custom simulator accepts in the config file.