dnadna.data_preprocessing

Implements pre-processing of simulation data in preparation for a training run, including training config initialization.

Classes

DataPreprocessor(config[, dataset, …])

Exceptions

ScenarioValidationError

Exception raised if the simulation cannot be used with the given training parameters.

class dnadna.data_preprocessing.DataPreprocessor(config, dataset=None, learned_params=None, validate=True)[source]

Bases: dnadna.utils.config.ConfigMixIn

check_scenario(scenario_idx, scenario)[source]

Perform validation of an individual scenario’s simulations against training configuration such as the minimal number of SNPs, among other details.

check_scenarios()[source]

Perform validation checks against all scenarios in the dataset, optionally using multiple processes.

Returns a generator yielding (keep, scenario_idx, n_replicates) tuples, where keep is True/False depending on whether or not the scenario passed validation and will be used for training, scenario_idx is the index of the scenario checked, and n_replicates the number of valid simulation replicates found within that scenario.

static check_snp_sample(scenario_idx, replicate_idx, snp, min_snp=None, min_indiv=None)[source]

Check that a single SNPSample conforms to the pre-processing requirements.

preprocess_scenario_params(run_id=None, progress_bar=False)[source]

Returns a copy of the simulation’s original scenario params table suitable for the given training parameters.

Also returns a copy of the original training configuration with some post-processed training parameters inserted.

All scenarios in the simulation data are checked against the training parameters and unsuitable data is removed from the training set. This part can be the most time-consuming depending on the size of the data set, so an optional progress bar can be displayed during this operation.

Regression parameter values are also normalized around their mean and standard deviation, and specified parameters are log-transformed.

validate_config(config)[source]

Additional validation of the config preprocessing config file.

exception dnadna.data_preprocessing.ScenarioValidationError[source]

Bases: Exception

Exception raised if the simulation cannot be used with the given training parameters.