dnadna.data_preprocessing
Implements pre-processing of simulation data in preparation for a training run, including training config initialization.
Classes
|
Exceptions
Exception raised if the simulation cannot be used with the given training parameters. |
- class dnadna.data_preprocessing.DataPreprocessor(config, dataset=None, learned_params=None, validate=True)[source]
Bases:
dnadna.utils.config.ConfigMixIn
- check_scenario(scenario_idx, scenario)[source]
Perform validation of an individual scenario’s simulations against training configuration such as the minimal number of SNPs, among other details.
- check_scenarios()[source]
Perform validation checks against all scenarios in the dataset, optionally using multiple processes.
Returns a generator yielding
(keep, scenario_idx, n_replicates)
tuples, wherekeep
is True/False depending on whether or not the scenario passed validation and will be used for training,scenario_idx
is the index of the scenario checked, andn_replicates
the number of valid simulation replicates found within that scenario.
- static check_snp_sample(scenario_idx, replicate_idx, snp, min_snp=None, min_indiv=None)[source]
Check that a single
SNPSample
conforms to the pre-processing requirements.
- preprocess_scenario_params(run_id=None, progress_bar=False)[source]
Returns a copy of the simulation’s original scenario params table suitable for the given training parameters.
Also returns a copy of the original training configuration with some post-processed training parameters inserted.
All scenarios in the simulation data are checked against the training parameters and unsuitable data is removed from the training set. This part can be the most time-consuming depending on the size of the data set, so an optional progress bar can be displayed during this operation.
Regression parameter values are also normalized around their mean and standard deviation, and specified parameters are log-transformed.