dnadna.simulator

Base implementation for Simulator classes.

This can be used to implement new population genetics simulators, or to wrap existing simulators in a common interface used by DNADNA, as well as to adapt existing simulation data (without recomputing it) to the DNADNA interface.

Classes

Simulator([config, validate])

Base class for implementing simulators for use with the DNADNA training system.

class dnadna.simulator.Simulator(config={}, validate=True, **kwargs)[source]

Bases: dnadna.utils.config.ConfigMixIn, dnadna.utils.plugins.Pluggable

Base class for implementing simulators for use with the DNADNA training system.

None of the other components in DNADNA directly depend on using this interface, so its primary purpose is as a convenience, and example for how to implement a basic SNP data simulator.

It can be used either as the base class for a novel simulator (see the dnadna.examples.one_event module for a concrete example); it can also be use to wrap existing simulation code under a common API. Furthermore, it can be used as a converter, to input existing simulation data, and translate it to the format expected by DNADNA’s other tools.

classmethod class_from_config(config)[source]

Load the appropriate Simulator subclass determined from the config and optionally validate the config against a Simulator-specific config schema.

config_default = Config({'data_root': '.', 'dataset_name': 'generic', 'scenario_params_path': 'scenario_params.csv', 'data_source': {'format': 'dnadna', 'filename_format': 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz', 'keys': ['SNP', 'POS']}, 'ignore_missing': False, 'cache_validation_set': False})

A dict containing a default sample configuration for this simulator; the sample configuration need not be fully functional, but it can be used to initialize a template config file for the simulation.

config_filename_format = '{dataset_name}_simulation_config.yml'

Format string for the default config filename; it is passed the dataset_name from the config as a template variable.

config_schema = 'simulation'

The name of the schema or a dict containing a schema against which to validate the configuration for this simulator. Custom simulators can override or extend the base ‘simulation’ schema.

abstract generate_scenario_params()[source]

Generate and return the pandas.DataFrame containing scenario parameters for this simulation.

The scenario parameters table (or scenario params for short) is a pandas.DataFrame containing at a minimum: an pandas.Index named 'scenario_idx' which gives an integer label to each scenario in the table, and a column named 'n_replicates' giving the number of replicates for each scenario. A “replicate” is a copy of a scenario, generated using the same scenario parameters, but containing potentially different data (i.e. with different randomization). If a simulation does not use scenario replicates, this the value in this column can simply be set to 1 for each scenario.

Beyond this, the scenario parameters may contain any number of columns giving the known values of parameters that were used to generate the simulation, such as “mutation rate” or “recombination rate”, among many others. See dnadna.examples.one_event for an example.

Its only argument is self, so all information needed to generate the scenario params should be provided to the simulator via its __init__ method, especially via the simulator config file.

This method may either implement the task of generating scenario parameters for the simulation; or if this is a wrapper class for an existing simulation, it may simply return the parameters of that simulation, possibly reorganized into the correct format.

classmethod get_schema()[source]

Returns the schema extensions for simulator plugins.

Simulation configs are validated against the base simulation schema plus any simulator-specific schemas provided by Simulator plugins.

load_scenario_params(scenario_id=0, n_scenarios=None, load_existing=False, save=False)[source]

Returns a pandas.DataFrame containing the simulation parameters table, either by reading it from a file (currently only CSV supported) given by self.scenario_params_path, or by generating it by calling Simulator.generate_scenario_params.

Keyword Arguments
  • scenario_id (int) – (optional) – The ID number of the initial scenario to return (default: 0, i.e. return all parameter scenarios).

  • n_scenarios (int) – (optional) – The number of scenarios (started from scenario_id to return); by default returns all scenarios in the scenario params table starting from scenario_id.

  • load_existing (bool) – (optional) – If the scenario params file given by the config already exists, load the existing one instead of regenerating it.

  • save (bool) – (optional) – If the scenario params file does not already exist in the filename given by self.scenario_params_path, it is generated by calling Simulator.generate_scenario_params. If save=True the generated file is also saved to that path. Otherwise it is not saved, and the pandas.DataFrame is simply returned.

abstract property name

The name of the simulator, used primarily to select this simulator from the command-line. If multiple simulators with the same name are loaded simultaneously, only the last-loaded will be used (and a warning is issued).

preprocessing_config_default = Config({})

Simulators may extend the default preprocessing configuration file with additional settings optimized for the simulator.

abstract simulate_scenario(scenario, verbose=False)[source]

Simulate a single scenario, given as a named tuple from the simulation params table as returned by pandas.DataFrame.itertuples. Returns a an iterator over replicates in the scenario.

The items returned from this method should be a 3-tuple in the format (scenario_idx, rep_idx, SNPSample(snp=SNPs, pos=positions), where scenario_idx is the index into the scenario params table for the parameters that were used to produce this simulation; rep_idx is the replicate index, in the case where multiple replicates are generated for each scenario, if not it can just be 0; the final element is an SNPSample instance containing the SNP matrix and positions array generated by the simulator.

This method is called from Simulator.simulate_scenarios, which loops over all rows in the simulation params and calls this method for each scenario, possibly parallelized if parallelization is enabled. For finer control over simulation flow control, Simulator.simulate_scenarios may also be overridden by a subclass.

simulate_scenarios(scenario_params, n_cpus=1, verbose=False)[source]

Return an iterator over simulated SNPs given a scenario params table (see Simulator.load_scenario_params for the format of this table).

This method should iterate over all scenarios in the simulation (possibly generating the simulation as well, or reading it from an existing simulation), which are then each passed to Simulator.simulate_scenario for each scenario.

The items returned from the iterator should be a 3-tuple in the format (scenario_idx, rep_idx, SNPSample(snp=SNPs, pos=positions), where scenario_idx is the index into the scenario params table for the parameters that were used to produce this simulation; rep_idx is the replicate index, in the case where multiple replicates are generated for each scenario, if not it can just be 0; the final element is an SNPSample instance containing the SNP matrix and positions array generated by the simulator.

Parameters

scenario_params (pandas.DataFrame) – The scenario params table for the scenarios to simulate.

Keyword Arguments

n_cpus (int) – (optional) – If 1, scenarios are simulated in serial; for n_cpus > 1 a process pool of size n_cpus is used. If n_cpus = 0 or None, use the default number of CPUs used by multiprocessing.pool.Pool.

training_config_default = Config({})

Simulators may extend the default training configuration file with additional settings optimized for the simulator.