dnadna.simulator
Base implementation for Simulator
classes.
This can be used to implement new population genetics simulators, or to wrap existing simulators in a common interface used by DNADNA, as well as to adapt existing simulation data (without recomputing it) to the DNADNA interface.
Classes
|
Base class for implementing simulators for use with the DNADNA training system. |
- class dnadna.simulator.Simulator(config={}, validate=True, **kwargs)[source]
Bases:
dnadna.utils.config.ConfigMixIn
,dnadna.utils.plugins.Pluggable
Base class for implementing simulators for use with the DNADNA training system.
None of the other components in DNADNA directly depend on using this interface, so its primary purpose is as a convenience, and example for how to implement a basic SNP data simulator.
It can be used either as the base class for a novel simulator (see the
dnadna.examples.one_event
module for a concrete example); it can also be use to wrap existing simulation code under a common API. Furthermore, it can be used as a converter, to input existing simulation data, and translate it to the format expected by DNADNA’s other tools.- classmethod class_from_config(config)[source]
Load the appropriate Simulator subclass determined from the config and optionally validate the config against a Simulator-specific config schema.
- config_default = Config({'data_root': '.', 'dataset_name': 'generic', 'scenario_params_path': 'scenario_params.csv', 'data_source': {'format': 'dnadna', 'filename_format': 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz', 'keys': ['SNP', 'POS']}, 'ignore_missing': False, 'cache_validation_set': False})
A
dict
containing a default sample configuration for this simulator; the sample configuration need not be fully functional, but it can be used to initialize a template config file for the simulation.
- config_filename_format = '{dataset_name}_simulation_config.yml'
Format string for the default config filename; it is passed the
dataset_name
from the config as a template variable.
- config_schema = 'simulation'
The name of the schema or a
dict
containing a schema against which to validate the configuration for this simulator. Custom simulators can override or extend the base ‘simulation’ schema.
- abstract generate_scenario_params()[source]
Generate and return the
pandas.DataFrame
containing scenario parameters for this simulation.The scenario parameters table (or scenario params for short) is a
pandas.DataFrame
containing at a minimum: anpandas.Index
named'scenario_idx'
which gives an integer label to each scenario in the table, and a column named'n_replicates'
giving the number of replicates for each scenario. A “replicate” is a copy of a scenario, generated using the same scenario parameters, but containing potentially different data (i.e. with different randomization). If a simulation does not use scenario replicates, this the value in this column can simply be set to1
for each scenario.Beyond this, the scenario parameters may contain any number of columns giving the known values of parameters that were used to generate the simulation, such as “mutation rate” or “recombination rate”, among many others. See
dnadna.examples.one_event
for an example.Its only argument is
self
, so all information needed to generate the scenario params should be provided to the simulator via its__init__
method, especially via the simulator config file.This method may either implement the task of generating scenario parameters for the simulation; or if this is a wrapper class for an existing simulation, it may simply return the parameters of that simulation, possibly reorganized into the correct format.
- classmethod get_schema()[source]
Returns the schema extensions for simulator plugins.
Simulation configs are validated against the base simulation schema plus any simulator-specific schemas provided by
Simulator
plugins.
- load_scenario_params(scenario_id=0, n_scenarios=None, load_existing=False, save=False)[source]
Returns a
pandas.DataFrame
containing the simulation parameters table, either by reading it from a file (currently only CSV supported) given byself.scenario_params_path
, or by generating it by callingSimulator.generate_scenario_params
.- Keyword Arguments
scenario_id (int) – (optional) – The ID number of the initial scenario to return (default: 0, i.e. return all parameter scenarios).
n_scenarios (int) – (optional) – The number of scenarios (started from
scenario_id
to return); by default returns all scenarios in the scenario params table starting fromscenario_id
.load_existing (bool) – (optional) – If the scenario params file given by the config already exists, load the existing one instead of regenerating it.
save (bool) – (optional) – If the scenario params file does not already exist in the filename given by
self.scenario_params_path
, it is generated by callingSimulator.generate_scenario_params
. Ifsave=True
the generated file is also saved to that path. Otherwise it is not saved, and thepandas.DataFrame
is simply returned.
- abstract property name
The name of the simulator, used primarily to select this simulator from the command-line. If multiple simulators with the same name are loaded simultaneously, only the last-loaded will be used (and a warning is issued).
- preprocessing_config_default = Config({})
Simulators may extend the default preprocessing configuration file with additional settings optimized for the simulator.
- abstract simulate_scenario(scenario, verbose=False)[source]
Simulate a single scenario, given as a named tuple from the simulation params table as returned by
pandas.DataFrame.itertuples
. Returns a an iterator over replicates in the scenario.The items returned from this method should be a 3-tuple in the format
(scenario_idx, rep_idx, SNPSample(snp=SNPs, pos=positions)
, wherescenario_idx
is the index into the scenario params table for the parameters that were used to produce this simulation;rep_idx
is the replicate index, in the case where multiple replicates are generated for each scenario, if not it can just be0
; the final element is anSNPSample
instance containing the SNP matrix and positions array generated by the simulator.This method is called from
Simulator.simulate_scenarios
, which loops over all rows in the simulation params and calls this method for each scenario, possibly parallelized if parallelization is enabled. For finer control over simulation flow control,Simulator.simulate_scenarios
may also be overridden by a subclass.
- simulate_scenarios(scenario_params, n_cpus=1, verbose=False)[source]
Return an iterator over simulated SNPs given a scenario params table (see
Simulator.load_scenario_params
for the format of this table).This method should iterate over all scenarios in the simulation (possibly generating the simulation as well, or reading it from an existing simulation), which are then each passed to
Simulator.simulate_scenario
for each scenario.The items returned from the iterator should be a 3-tuple in the format
(scenario_idx, rep_idx, SNPSample(snp=SNPs, pos=positions)
, wherescenario_idx
is the index into the scenario params table for the parameters that were used to produce this simulation;rep_idx
is the replicate index, in the case where multiple replicates are generated for each scenario, if not it can just be0
; the final element is anSNPSample
instance containing the SNP matrix and positions array generated by the simulator.- Parameters
scenario_params (
pandas.DataFrame
) – The scenario params table for the scenarios to simulate.- Keyword Arguments
n_cpus (int) – (optional) – If
1
, scenarios are simulated in serial; forn_cpus > 1
a process pool of sizen_cpus
is used. Ifn_cpus = 0
orNone
, use the default number of CPUs used bymultiprocessing.pool.Pool
.
- training_config_default = Config({})
Simulators may extend the default training configuration file with additional settings optimized for the simulator.