dnadna.datasets
Utilities for loading data from data sets, including support for different data set formats:
Classes for reading different dataset formats. Datasets are collections of SNP files for multiple scenarios, possibly with multiple replicates per scenario:
The
NpzSNPSource
class reads a data set of multiple parameter scenarios with (possibly) multiple replicates per scenario, stored in NPZ files in a particular filesystem layout, known as the DNADNA Format. This is the default data set format understood by DNADNA.The
DictSNPSource
class reads a JSON-based data set format which is less efficient both in terms of storage compactness and parsing/serializing, that allows plain-text storage of SNP data. Currently this is used primarily in testing.
The
DNATrainingDataset
and its simpler base classDNADataset
are implementations of a PyTorchDataset
used for loading SNP data (in the form ofSNPSamples
along with their associated scenario parameters, for both training sets and validation sets during model training. This works independently of what the dataset format is (the dataset format is implemented as anSNPSource
such as the two listed above, which is an abstract interface for arbitrary dataset formats). (TODO: There is currently noSNPSource
base class, but one should be implemented in order to help define the interface.)
Classes
|
Simplified base class for DNADNA datasets which simply maps an integer index to an |
|
|
|
Partially implemented |
|
SNP source that reads from a JSON-like data structure consisting of a dict with |
|
SNP source that returns scenarios from a fixed list of arbitrary files. |
|
SNP source that reads simulation data as |
A “SNPSource” is a class for loading |
Exceptions
|
Exception raised when a specified sample is not found in an SNP source. |
- class dnadna.datasets.DNADataset(config={}, validate=True, source=None, scenario_params=None, scenario_set=None, cached_set=None)[source]
Bases:
Generic
[torch.utils.data.dataset.T_co
]Simplified base class for DNADNA datasets which simply maps an integer index to an
SNPSample
instance from the simulation dataset.This has two modes of operation: One where a
scenario_params
table is given as apandas.DataFrame
in the format described for the DNADNA Format. In this case, all the scenarios and replicates described in that table are returned (where they exist), and for each item in the dataset a(scenario_idx, replicate_idx, snp_sample, scenario_params)
tuple is returned.In the second mode of operation,
scenario_params
is not given, and the data sources are simply looped over directly. In this case a 4-tuple of(scenario_idx, replicate_idx, snp_sample, None)
is returned for each item.The
DNATrainingDataset
is the more complete implementation which can perform additional transformations on the data when used in model training, and which keeps separate training and validation sets.Given a
scenario_set=<scenario_idx>
argument, only the data in a single scenario are returned; this may also be a list/set of scenario indices to consider.- property cached_set
Indices whose samples should be cached in memory.
- classmethod from_config_file(filename, *args, validate=True, source=None, scenario_params=None, scenario_set=None, **kwargs)[source]
Load the
Config
from a file.Additional
kwargs
are passed tofrom_file
.The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.
- get(index, ignore_missing=None)[source]
Same as
DNATrainingDataset.__getitem__
but adds additional optional arguments.- Parameters
index (index of the sample to get from the dataset) –
- Keyword Arguments
ignore_missing (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the
ignore_missing
option in the dataset configuration, but this allows overriding the config file.
- class dnadna.datasets.DNATrainingDataset(config={}, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None)[source]
Bases:
Generic
[torch.utils.data.dataset.T_co
]- classmethod from_config_file(filename, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None, **kwargs)[source]
Load the
Config
from a file.Additional
kwargs
are passed tofrom_file
.The additional keyword arguments are passed to the dict serializer, and the config is validated against the training schema.
- class dnadna.datasets.DatasetTransformationMixIn(config, transforms=None, param_set=None, **kwargs)[source]
Bases:
Generic
[torch.utils.data.dataset.T_co
]Partially implemented
Dataset
which accepts parameters for transforming the SNP data returned from the data source.- Parameters
transforms (
list
) –list
giving transform names or transform descriptions (a transform name plus its parameters) as specified in thedataset_transforms
property in the training config file. See also ref:schema-training
. May also contain instances ofTransform
.param_set (
ParamSet
) –ParamsSet
object representing all the details of the parameters to learn in training, including the values of those parameters for the training and validation sets (the pre-processed scenario params); information about the parameters can be used by some transforms.positional and keyword arguments are passed to (Additional) –
so that this can be used as a mix-in with arbitrary (super()__init__()) –
subclasses. (DNADataset) –
- static collate_batch(batch)[source]
Specifies how multiple scenario samples are collated into batches.
Each batch element is a single element as returned by
DNATrainingDataset.__getitem__
:(scenario_idx, replicate_idx, snp_sample, target)
.For input samples and targets are collated into batches “vertically”, so that the size of the first dimension represents the number of items in a batch.
Examples
>>> import torch >>> from dnadna.datasets import DNATrainingDataset >>> from dnadna.snp_sample import SNPSample >>> fake_snps = [torch.rand(3, 3 + i) for i in range(5)] >>> fake_snps = [SNPSample(s[1:], s[0]) for s in fake_snps] >>> fake_params = [torch.rand(4, dtype=torch.float64) for _ in range(5)] >>> fake_batch = list(zip(range(5), [0] * 5, fake_snps, fake_params)) >>> collated_batch = DNATrainingDataset.collate_batch(fake_batch) >>> scenario_idxs, inputs, targets = collated_batch >>> bool((torch.arange(5) == scenario_idxs).all()) True >>> inputs.shape # last dim should be num SNPs in largest fake SNP torch.Size([5, 3, 7]) >>> bool((inputs[0,:3,:3] == fake_snps[0].tensor).all()) True >>> bool((inputs[0,3:,3:] == -1).all()) True >>> bool((inputs[-1] == fake_snps[-1].tensor).all()) True >>> targets.shape torch.Size([5, 4]) >>> [bool((fake_params[bat].float() == targets[bat]).all()) ... for bat in range(targets.shape[0])] [True, True, True, True, True]
- get(index, ignore_missing=None)[source]
Same as
DNATrainingDataset.__getitem__
but adds additional optional arguments.- Parameters
index (index of the sample to get from the dataset) –
- Keyword Arguments
ignore_missing (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the
ignore_missing
option in the dataset configuration, but this allows overriding the config file.
- property test_set
Set of indices to use for testing.
- property training_set
Set of indices to use for training.
- property transforms
The composed set of transforms to apply to the dataset.
Either
dnadna.transforms.Compose
or a dict mapping dataset splits (“training”, “validation”, “test”) to their correspondingCompose
of transforms.
- property validation_set
Set of indices to use for validation.
- class dnadna.datasets.DictSNPSource(scenarios, position_format=None, filename=None, lazy=True)[source]
Bases:
dnadna.datasets.SNPSource
SNP source that reads from a JSON-like data structure consisting of a dict with
(simulation, replicate)
pairs for keys, andSNPSamples
in JSON-compatible format for values (seeto_dict
).Currently used just by the test suite, but may be useful in other contexts as well (e.g. serialization of simulations).
- Parameters
scenarios (dict) –
dict
with(simulate, replicate)
tuple keys, and values in the format output byto_dict
, or the values may also beSNPSample
instances (useful for testing).- Keyword Arguments
position_format (dict) – (optional) – Position format dict corresponding to the
pos_format
argument toSNPSample
(currently all samples in the dataset are assumed to have the same position formats).filename (str) – (optional) – If the
scenarios
dict was read from a file (e.g. a JSON or YAML file) this can be set to the filename; this is used just as a convenience when reporting errors.lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not converted from the dict format until needed. Use
lazy=False
to ensure that the data is immediately converted.
Examples
>>> from dnadna.datasets import DictSNPSource >>> from dnadna.snp_sample import SNPSample >>> sample = SNPSample([[0, 1], [1, 0]], [0.1, 0.2]) >>> source = DictSNPSource({(0, 0): sample.to_dict()}, ... filename='scenario_0_0.json') >>> source.scenarios {(0, 0): {'SNP': ['01', '10'], 'POS': [0.1, 0.2]}} >>> (0, 0) in source True >>> source[0, 0] SNPSample( snp=tensor([[0, 1], [1, 0]], dtype=torch.uint8), pos=tensor([0.1000, 0.2000], dtype=torch.float64), pos_format={'normalized': True}, path='scenario_0_0.json' )
If the requested sample doesn’t exist in the dataset a
MissingSNPSample
exception is raised:>>> (0, 1) in source False >>> source[0, 1] Traceback (most recent call last): ... dnadna.datasets.MissingSNPSample: could not load scenario 0 replicate 1 from "scenario_0_0.json": KeyError((0, 1))
- class dnadna.datasets.FileListSNPSource(filenames)[source]
Bases:
object
SNP source that returns scenarios from a fixed list of arbitrary files.
Because the concepts of “scenarios” and “replicates” are not necessary applicable to an arbitrary list of files, each file is considered a single scenario of one replicate (e.g.
source[3, 0]
returns the contents of the fourth file in the list.
- exception dnadna.datasets.MissingSNPSample(scenario, replicate, path, reason=None)[source]
Bases:
Exception
Exception raised when a specified sample is not found in an SNP source.
- class dnadna.datasets.NpzSNPSource(root_dir, dataset_name, filename_format=None, keys=('SNP', 'POS'), position_format=None, lazy=True)[source]
Bases:
dnadna.datasets.SNPSource
SNP source that reads simulation data as
SNPSamples
stored on disk in DNADNA’s native “dnadna” format.Each simulation is stored in a NumPy NPZ file containing two arrays, by default keyed by
'SNP'
for the SNP matrix, and'POS'
for the positions array.There is one
.npz
file for each replicate of each scenario, laid out in a filesystem format. The exact layout and filename can be specified by thefilename_format
argument to this class’s constructor, but the default layout is as specified inNpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT
, which is also the documented format assumed by the “dnadna” format.- Parameters
root_dir (str, pathlib.Path) – The root directory of the DNADNA dataset. All filenames generated from the
filename_format
are appended to this directory.dataset_name (str) – The name of the dataset–same as that specified in the simulation config for this dataset.
- Keyword Arguments
filename_format (str) – (optional) – A string in Python format string syntax specifying the format for filenames of individual simulations in this dataset. The format string can contain 3 replacement fields:
{dataset_name}
which is filled in with the model name given by thedataset_name
parameter above,{scenario}
which is filled with the scenario index, and{replicate}
which is filled with the replicate index. If the scenario and replicate indices are zero-padded in the filenames, the amount of zero-padding may be explicitly specified by writing the format string like{scenario:05}
(for scenario indices padded up to 5 zeros). However, if no-zero padding is specified in the format string, the appropriate amount of zero-padding is automatically guessed by filenames actually present in the dataset. Therefore the defaultfilename_format
,NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT
can be used regardless of the amount of zero-padding used in a given dataset.keys (tuple) – (optional) – A 2-tuple of
(snp_key, pos_key)
giving the keywords for the SNP matrix and the position array in the NPZ file. The default('SNP', 'POS')
is the default for the “dnadna” format, but different names may be specified for these arrays.position_format (dict) – (optional) – The format of the position arrays in the dataset (currently all samples in the dataset are assumed to have the same position formats). Corresponds to the
pos_format
argument toSNPSample
.lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not read from disk until needed. Use
lazy=False
to ensure that the data is immediately loaded into memory.
Examples
>>> import numpy as np >>> from dnadna.datasets import NpzSNPSource >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Make a few random SNP and position arrays:
>>> dataset = {} >>> filename_format = 'my_model_{scenario:03}_{replicate:03}.npz' >>> for scenario_idx, replicate_idx in zip(range(2), range(2)): ... snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') ... pos = np.sort(np.random.random(10)) ... sample = SNPSample(snp, pos) ... filename = tmp / filename_format.format( ... scenario=scenario_idx, replicate=replicate_idx) ... sample.to_npz(filename) ... dataset[(scenario_idx, replicate_idx)] = sample
Instantiate the
NpzSNPSource
and load a couple samples:>>> source = NpzSNPSource(tmp, 'my_model', filename_format=filename_format) >>> source[0, 0] SNPSample( snp=tensor([[...], ... [...]], dtype=torch.uint8), pos=tensor([...], dtype=torch.float64), pos_format={'normalized': True}, path=...Path('...my_model_000_000.npz') ) >>> source[0, 0] == dataset[0, 0] True >>> source[1, 1] == dataset[1, 1] True >>> source[2, 0] Traceback (most recent call last): ... dnadna.datasets.MissingSNPSample: could not load scenario 2 replicate 0 from "...my_model_002_000.npz": FileNotFoundError(2, 'No file matching or similar to')
- DEFAULT_NPZ_FILENAME_FORMAT = 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz'
Default format string for filenames relative to the
root_dir
of anNpzSNPSource
.This is the default filesystem layout for the DNADNA format. Each scenario has its own directory named
scenario_<scenario_idx>
where thescenario_idx
is typically zero-padded the correct amount for the total number of scenarios in the dataset.Each simulation file in a scenario has the filename
<model-name>_<scenario_idx>_<replicate_idx>.npz
where bothscenario_idx
andreplicate_idx
are again zero-padded an appropriate amount.In a simulation config with the option
{"data_source": {"format": "dnadna"}}
, this default filename format can be overridden with the{"data_source": {"filename_format": "..."}}
option.
- classmethod from_config(config, validate=True)[source]
Instantiate an
NpzSNPSource
from a simulationConfig
matching the simulation schema.
- class dnadna.datasets.SNPSource[source]
Bases:
dnadna.utils.plugins.Pluggable
A “SNPSource” is a class for loading
SNPSample
objects from some data source.Subclasses of this class represent different data formats from which samples can be loaded.
This is in a way “lower-level” than
DNADataset
.DNADataset
is an abstraction that loads SNPSamples from a data source, possibly performs some transforms on them, and returns them. From the point of view ofDNADataset
the actual on-disk format from which the samples are read is abstracted out toSNPSource
.In fact it may not even be an “on-disk” format; for example one could implement a
SNPSource
plugin that loads samples from an S3 bucket.The “main” implementation of
SNPSource
isNpzSNPSource
which loads samples organized on disk in the “dnadna” format. The other built-in implementations include:FileListSNPSource
– a simple format that simply reads a list ofSNPSample
s from a list of filenames; this is used primarily by thednadna predict
command for reading in a list of files on which to make predictions.DictSNPSource
– used primarily for testing, it can read samples from a JSON-compatible dict format; see its documentation for more details.
- classmethod from_config(config, validate=True)[source]
Instantiate an
SNPSource
from datasetConfig
matching the dataset schema.Although configuration specific to a given
SNPSource
subclass may have its own format-specific schema, these are still passed the full dataset config, which may contain additional properties (such asdata_root
) that might be useful to a given format.Subclasses should implement this method in order to specify how to instantiate it from a config file; otherwise it cannot be used as a configurable plugin.
- classmethod from_config_file(filename, validate=True, **kwargs)[source]
Like
from_config
but given a filename instead of aConfig
object.The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.