dnadna.datasets

Utilities for loading data from data sets, including support for different data set formats:

  • Classes for reading different dataset formats. Datasets are collections of SNP files for multiple scenarios, possibly with multiple replicates per scenario:

    • The NpzSNPSource class reads a data set of multiple parameter scenarios with (possibly) multiple replicates per scenario, stored in NPZ files in a particular filesystem layout, known as the DNADNA Format. This is the default data set format understood by DNADNA.

    • The DictSNPSource class reads a JSON-based data set format which is less efficient both in terms of storage compactness and parsing/serializing, that allows plain-text storage of SNP data. Currently this is used primarily in testing.

  • The DNATrainingDataset and its simpler base class DNADataset are implementations of a PyTorch Dataset used for loading SNP data (in the form of SNPSamples along with their associated scenario parameters, for both training sets and validation sets during model training. This works independently of what the dataset format is (the dataset format is implemented as an SNPSource such as the two listed above, which is an abstract interface for arbitrary dataset formats). (TODO: There is currently no SNPSource base class, but one should be implemented in order to help define the interface.)

Classes

DNADataset([config, validate, source, …])

Simplified base class for DNADNA datasets which simply maps an integer index to an SNPSample instance from the simulation dataset.

DNATrainingDataset([config, validate, …])

DatasetTransformationMixIn(config[, …])

Partially implemented Dataset which accepts parameters for transforming the SNP data returned from the data source.

DictSNPSource(scenarios[, position_format, …])

SNP source that reads from a JSON-like data structure consisting of a dict with (simulation, replicate) pairs for keys, and SNPSamples in JSON-compatible format for values (see to_dict).

FileListSNPSource(filenames)

SNP source that returns scenarios from a fixed list of arbitrary files.

NpzSNPSource(root_dir, dataset_name[, …])

SNP source that reads simulation data as SNPSamples stored on disk in DNADNA’s native “dnadna” format.

SNPSource()

A “SNPSource” is a class for loading SNPSample objects from some data source.

Exceptions

MissingSNPSample(scenario, replicate, path)

Exception raised when a specified sample is not found in an SNP source.

class dnadna.datasets.DNADataset(config={}, validate=True, source=None, scenario_params=None, scenario_set=None, cached_set=None)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Simplified base class for DNADNA datasets which simply maps an integer index to an SNPSample instance from the simulation dataset.

This has two modes of operation: One where a scenario_params table is given as a pandas.DataFrame in the format described for the DNADNA Format. In this case, all the scenarios and replicates described in that table are returned (where they exist), and for each item in the dataset a (scenario_idx, replicate_idx, snp_sample, scenario_params) tuple is returned.

In the second mode of operation, scenario_params is not given, and the data sources are simply looped over directly. In this case a 4-tuple of (scenario_idx, replicate_idx, snp_sample, None) is returned for each item.

The DNATrainingDataset is the more complete implementation which can perform additional transformations on the data when used in model training, and which keeps separate training and validation sets.

Given a scenario_set=<scenario_idx> argument, only the data in a single scenario are returned; this may also be a list/set of scenario indices to consider.

property cached_set

Indices whose samples should be cached in memory.

classmethod from_config_file(filename, *args, validate=True, source=None, scenario_params=None, scenario_set=None, **kwargs)[source]

Load the Config from a file.

Additional kwargs are passed to from_file.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.

get(index, ignore_missing=None)[source]

Same as DNATrainingDataset.__getitem__ but adds additional optional arguments.

Parameters

index (index of the sample to get from the dataset) –

Keyword Arguments

ignore_missing (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the ignore_missing option in the dataset configuration, but this allows overriding the config file.

class dnadna.datasets.DNATrainingDataset(config={}, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

classmethod from_config_file(filename, validate=True, source=None, scenario_params=None, transforms=None, learned_params=None, **kwargs)[source]

Load the Config from a file.

Additional kwargs are passed to from_file.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the training schema.

class dnadna.datasets.DatasetTransformationMixIn(config, transforms=None, param_set=None, **kwargs)[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Partially implemented Dataset which accepts parameters for transforming the SNP data returned from the data source.

Parameters
  • transforms (list) – list giving transform names or transform descriptions (a transform name plus its parameters) as specified in the dataset_transforms property in the training config file. See also ref:schema-training. May also contain instances of Transform.

  • param_set (ParamSet) – ParamsSet object representing all the details of the parameters to learn in training, including the values of those parameters for the training and validation sets (the pre-processed scenario params); information about the parameters can be used by some transforms.

  • positional and keyword arguments are passed to (Additional) –

  • so that this can be used as a mix-in with arbitrary (super()__init__()) –

  • subclasses. (DNADataset) –

static collate_batch(batch)[source]

Specifies how multiple scenario samples are collated into batches.

Each batch element is a single element as returned by DNATrainingDataset.__getitem__: (scenario_idx, replicate_idx, snp_sample, target).

For input samples and targets are collated into batches “vertically”, so that the size of the first dimension represents the number of items in a batch.

Examples

>>> import torch
>>> from dnadna.datasets import DNATrainingDataset
>>> from dnadna.snp_sample import SNPSample
>>> fake_snps = [torch.rand(3, 3 + i) for i in range(5)]
>>> fake_snps = [SNPSample(s[1:], s[0]) for s in fake_snps]
>>> fake_params = [torch.rand(4, dtype=torch.float64) for _ in range(5)]
>>> fake_batch = list(zip(range(5), [0] * 5, fake_snps, fake_params))
>>> collated_batch = DNATrainingDataset.collate_batch(fake_batch)
>>> scenario_idxs, inputs, targets = collated_batch
>>> bool((torch.arange(5) == scenario_idxs).all())
True
>>> inputs.shape  # last dim should be num SNPs in largest fake SNP
torch.Size([5, 3, 7])
>>> bool((inputs[0,:3,:3] == fake_snps[0].tensor).all())
True
>>> bool((inputs[0,3:,3:] == -1).all())
True
>>> bool((inputs[-1] == fake_snps[-1].tensor).all())
True
>>> targets.shape
torch.Size([5, 4])
>>> [bool((fake_params[bat].float() == targets[bat]).all())
...  for bat in range(targets.shape[0])]
[True, True, True, True, True]
get(index, ignore_missing=None)[source]

Same as DNATrainingDataset.__getitem__ but adds additional optional arguments.

Parameters

index (index of the sample to get from the dataset) –

Keyword Arguments

ignore_missing (bool) – (optional) – Whether or not to raise an error if the sample file is missing or can’t be loaded for another reason. By default this defers to the ignore_missing option in the dataset configuration, but this allows overriding the config file.

property test_set

Set of indices to use for testing.

property training_set

Set of indices to use for training.

property transforms

The composed set of transforms to apply to the dataset.

Either dnadna.transforms.Compose or a dict mapping dataset splits (“training”, “validation”, “test”) to their corresponding Compose of transforms.

property validation_set

Set of indices to use for validation.

class dnadna.datasets.DictSNPSource(scenarios, position_format=None, filename=None, lazy=True)[source]

Bases: dnadna.datasets.SNPSource

SNP source that reads from a JSON-like data structure consisting of a dict with (simulation, replicate) pairs for keys, and SNPSamples in JSON-compatible format for values (see to_dict).

Currently used just by the test suite, but may be useful in other contexts as well (e.g. serialization of simulations).

Parameters

scenarios (dict) – dict with (simulate, replicate) tuple keys, and values in the format output by to_dict, or the values may also be SNPSample instances (useful for testing).

Keyword Arguments
  • position_format (dict) – (optional) – Position format dict corresponding to the pos_format argument to SNPSample (currently all samples in the dataset are assumed to have the same position formats).

  • filename (str) – (optional) – If the scenarios dict was read from a file (e.g. a JSON or YAML file) this can be set to the filename; this is used just as a convenience when reporting errors.

  • lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not converted from the dict format until needed. Use lazy=False to ensure that the data is immediately converted.

Examples

>>> from dnadna.datasets import DictSNPSource
>>> from dnadna.snp_sample import SNPSample
>>> sample = SNPSample([[0, 1], [1, 0]], [0.1, 0.2])
>>> source = DictSNPSource({(0, 0): sample.to_dict()},
...                        filename='scenario_0_0.json')
>>> source.scenarios
{(0, 0): {'SNP': ['01', '10'], 'POS': [0.1, 0.2]}}
>>> (0, 0) in source
True
>>> source[0, 0]
SNPSample(
    snp=tensor([[0, 1],
                [1, 0]], dtype=torch.uint8),
    pos=tensor([0.1000, 0.2000], dtype=torch.float64),
    pos_format={'normalized': True},
    path='scenario_0_0.json'
)

If the requested sample doesn’t exist in the dataset a MissingSNPSample exception is raised:

>>> (0, 1) in source
False
>>> source[0, 1]
Traceback (most recent call last):
...
dnadna.datasets.MissingSNPSample: could not load scenario 0 replicate 1
from "scenario_0_0.json": KeyError((0, 1))
class dnadna.datasets.FileListSNPSource(filenames)[source]

Bases: object

SNP source that returns scenarios from a fixed list of arbitrary files.

Because the concepts of “scenarios” and “replicates” are not necessary applicable to an arbitrary list of files, each file is considered a single scenario of one replicate (e.g. source[3, 0] returns the contents of the fourth file in the list.

exception dnadna.datasets.MissingSNPSample(scenario, replicate, path, reason=None)[source]

Bases: Exception

Exception raised when a specified sample is not found in an SNP source.

class dnadna.datasets.NpzSNPSource(root_dir, dataset_name, filename_format=None, keys=('SNP', 'POS'), position_format=None, lazy=True)[source]

Bases: dnadna.datasets.SNPSource

SNP source that reads simulation data as SNPSamples stored on disk in DNADNA’s native “dnadna” format.

Each simulation is stored in a NumPy NPZ file containing two arrays, by default keyed by 'SNP' for the SNP matrix, and 'POS' for the positions array.

There is one .npz file for each replicate of each scenario, laid out in a filesystem format. The exact layout and filename can be specified by the filename_format argument to this class’s constructor, but the default layout is as specified in NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT, which is also the documented format assumed by the “dnadna” format.

Parameters
  • root_dir (str, pathlib.Path) – The root directory of the DNADNA dataset. All filenames generated from the filename_format are appended to this directory.

  • dataset_name (str) – The name of the dataset–same as that specified in the simulation config for this dataset.

Keyword Arguments
  • filename_format (str) – (optional) – A string in Python format string syntax specifying the format for filenames of individual simulations in this dataset. The format string can contain 3 replacement fields: {dataset_name} which is filled in with the model name given by the dataset_name parameter above, {scenario} which is filled with the scenario index, and {replicate} which is filled with the replicate index. If the scenario and replicate indices are zero-padded in the filenames, the amount of zero-padding may be explicitly specified by writing the format string like {scenario:05} (for scenario indices padded up to 5 zeros). However, if no-zero padding is specified in the format string, the appropriate amount of zero-padding is automatically guessed by filenames actually present in the dataset. Therefore the default filename_format, NpzSNPSource.DEFAULT_NPZ_FILENAME_FORMAT can be used regardless of the amount of zero-padding used in a given dataset.

  • keys (tuple) – (optional) – A 2-tuple of (snp_key, pos_key) giving the keywords for the SNP matrix and the position array in the NPZ file. The default ('SNP', 'POS') is the default for the “dnadna” format, but different names may be specified for these arrays.

  • position_format (dict) – (optional) – The format of the position arrays in the dataset (currently all samples in the dataset are assumed to have the same position formats). Corresponds to the pos_format argument to SNPSample.

  • lazy (bool) – (optional) – By default data is lazy-loaded, so that it is not read from disk until needed. Use lazy=False to ensure that the data is immediately loaded into memory.

Examples

>>> import numpy as np
>>> from dnadna.datasets import NpzSNPSource
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Make a few random SNP and position arrays:

>>> dataset = {}
>>> filename_format = 'my_model_{scenario:03}_{replicate:03}.npz'
>>> for scenario_idx, replicate_idx in zip(range(2), range(2)):
...     snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
...     pos = np.sort(np.random.random(10))
...     sample = SNPSample(snp, pos)
...     filename = tmp / filename_format.format(
...         scenario=scenario_idx, replicate=replicate_idx)
...     sample.to_npz(filename)
...     dataset[(scenario_idx, replicate_idx)] = sample

Instantiate the NpzSNPSource and load a couple samples:

>>> source = NpzSNPSource(tmp, 'my_model', filename_format=filename_format)
>>> source[0, 0]
SNPSample(
    snp=tensor([[...],
                ...
                [...]], dtype=torch.uint8),
    pos=tensor([...], dtype=torch.float64),
    pos_format={'normalized': True},
    path=...Path('...my_model_000_000.npz')
)
>>> source[0, 0] == dataset[0, 0]
True
>>> source[1, 1] == dataset[1, 1]
True
>>> source[2, 0]
Traceback (most recent call last):
...
dnadna.datasets.MissingSNPSample: could not load scenario 2 replicate 0
from "...my_model_002_000.npz": FileNotFoundError(2, 'No file matching or
similar to')
DEFAULT_NPZ_FILENAME_FORMAT = 'scenario_{scenario}/{dataset_name}_{scenario}_{replicate}.npz'

Default format string for filenames relative to the root_dir of an NpzSNPSource.

This is the default filesystem layout for the DNADNA format. Each scenario has its own directory named scenario_<scenario_idx> where the scenario_idx is typically zero-padded the correct amount for the total number of scenarios in the dataset.

Each simulation file in a scenario has the filename <model-name>_<scenario_idx>_<replicate_idx>.npz where both scenario_idx and replicate_idx are again zero-padded an appropriate amount.

In a simulation config with the option {"data_source": {"format": "dnadna"}}, this default filename format can be overridden with the {"data_source": {"filename_format": "..."}} option.

classmethod from_config(config, validate=True)[source]

Instantiate an NpzSNPSource from a simulation Config matching the simulation schema.

class dnadna.datasets.SNPSource[source]

Bases: dnadna.utils.plugins.Pluggable

A “SNPSource” is a class for loading SNPSample objects from some data source.

Subclasses of this class represent different data formats from which samples can be loaded.

This is in a way “lower-level” than DNADataset. DNADataset is an abstraction that loads SNPSamples from a data source, possibly performs some transforms on them, and returns them. From the point of view of DNADataset the actual on-disk format from which the samples are read is abstracted out to SNPSource.

In fact it may not even be an “on-disk” format; for example one could implement a SNPSource plugin that loads samples from an S3 bucket.

The “main” implementation of SNPSource is NpzSNPSource which loads samples organized on disk in the “dnadna” format. The other built-in implementations include:

  • FileListSNPSource – a simple format that simply reads a list of SNPSamples from a list of filenames; this is used primarily by the dnadna predict command for reading in a list of files on which to make predictions.

  • DictSNPSource – used primarily for testing, it can read samples from a JSON-compatible dict format; see its documentation for more details.

classmethod from_config(config, validate=True)[source]

Instantiate an SNPSource from dataset Config matching the dataset schema.

Although configuration specific to a given SNPSource subclass may have its own format-specific schema, these are still passed the full dataset config, which may contain additional properties (such as data_root) that might be useful to a given format.

Subclasses should implement this method in order to specify how to instantiate it from a config file; otherwise it cannot be used as a configurable plugin.

classmethod from_config_file(filename, validate=True, **kwargs)[source]

Like from_config but given a filename instead of a Config object.

The additional keyword arguments are passed to the dict serializer, and the config is validated against the dataset schema.