dnadna.transforms

Data transforms that can be applied during training.

Functions

keep_polymorphic_only(sample)

Remove sites that are no longer polymorphic in sample.

Classes

Compose(transforms)

Pseudo-transform that composes multiple transforms by applying them in order one after the other.

Crop([max_snp, max_indiv, keep_polymorphic_only])

Crop the SNP matrix and position array to a maximum size.

ReformatPosition([distance, normalized, …])

Changes the format of the input position array.

Rotate()

Given a sequence, return a random rotation of it along the SNP axis.

SnpFormat([format])

This transform specifies in what format the SNP matrix and position arrays are combined to form the input to the network.

Subsample(size[, keep_polymorphic_only])

Subsample SNP matrix of size (n, k), with n individuals and k SNPs and return a matrix of size (m, l), with m individuals and m < n and l SNPs with l <= k because columns without SNP anymore are not kept.

Transform()

Dataset transform.

ValidateSnp([uniform_shape])

A special transform that does not actually modify the data, but merely performs certain verifications on it.

Exceptions

InvalidSNPSample(msg[, sample])

Exception raised when a sample doesn’t meet the minimum requirements for the dataset.

TransformException(transform)

Exception raised when applying a Transform to an input.

class dnadna.transforms.Compose(transforms)[source]

Bases: object

Pseudo-transform that composes multiple transforms by applying them in order one after the other.

class dnadna.transforms.Crop(max_snp=None, max_indiv=None, keep_polymorphic_only=True)[source]

Bases: dnadna.transforms.Transform

Crop the SNP matrix and position array to a maximum size.

Parameters

keep_polymorphic_only (bool) – if true, SNPs that are not polymorphic are removed

Keyword Arguments
  • max_snp (int) – (optional) – crop the number of SNPs to at most max_snp

  • max_indiv (int) – (optional) – crop the number of individuals to at most max_indiv

exception dnadna.transforms.InvalidSNPSample(msg, sample=None)[source]

Bases: Exception

Exception raised when a sample doesn’t meet the minimum requirements for the dataset.

Used by ValidateSnp.

class dnadna.transforms.ReformatPosition(distance=None, normalized=None, circular=None, chromosome_size=None, initial_position=None)[source]

Bases: dnadna.transforms.Transform

Changes the format of the input position array.

It can change from normalized/unnormalized positions, and can convert between distance and absolute position formats.

When initializing this transform it is only necessary to specify those parameters that you explicitly want to convert.

Warning

This transform should be applied before any other transforms (e.g. rotate) which can modify the position orders, since this transform assumes positions are all in increasing order.

Keyword Arguments
  • distance (bool) – (optional) – If True, change positions to distances or vice-versa; if left unspecified the current position format is kept.

  • normalized (bool) – (optional) – Divide SNP positions/distances by chromosome size? If True, unnormalized positions are converted to normalized positions and vice-versa; if left unspecified the current normalization is kept. The chromosome_size argument is also required when changing the normalization, unless the chromosome_size is already specified on the inputs.

  • chromosome_size (int) – (optional) – Length of the chromosome; required when transforming from normalized to unnormalized positions. If left unspecified, but the input SNPSample has a chromosome_size in its pos_format, that it will be used.

  • circular (bool) – (optional) – Chromosome should be treated as circular when performing the transformation. Normally the input’s circularity is kept.

  • initial_position (int or float) – (optional) – A position to use as the initial position when converting from circular positions.

Examples

>>> from dnadna.snp_sample import SNPSample
>>> from dnadna.transforms import ReformatPosition
>>> import numpy as np

Initial example with unnormalized absolute positions and chromosome_size = 1000:

>>> sample = SNPSample(np.eye(4), [5, 460, 900, 952],
...                    pos_format={'normalized': False, 'distance': False,
...                                'chromosome_size': 1000})
>>> xf = ReformatPosition(normalized=True)
>>> xf((sample, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([0.0050, 0.4600, 0.9000, 0.9520], dtype=torch.float64),
    pos_format={'normalized': True, 'distance': False,
                'chromosome_size': 1000}
)
>>> xf = ReformatPosition(distance=True)
>>> xf((sample, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([  5, 455, 440,  52]),
    pos_format={'normalized': False, 'distance': True,
                'chromosome_size': 1000}
)
>>> xf = ReformatPosition(distance=True, normalized=True)
>>> dist_norm = xf((sample, None, None))[0]
>>> dist_norm
SNPSample(
    snp=tensor(...),
    pos=tensor([0.0050, 0.4550, 0.4400, 0.0520], dtype=torch.float64),
    pos_format={'normalized': True, 'distance': True,
                'chromosome_size': 1000}
)

Convert from normalized distances back to unnormalized positions:

>>> xf = ReformatPosition(distance=False, normalized=False)
>>> xf((dist_norm, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([  5, 460, 900, 952]),
    pos_format={'normalized': False, 'distance': False,
                'chromosome_size': 1000}
)

Convert from normalized linear distances to circular distances:

>>> xf = ReformatPosition(circular=True, initial_position=0.005)
>>> xf((dist_norm, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([0.0530, 0.4550, 0.4400, 0.0520], dtype=torch.float64),
    pos_format={'normalized': True, 'distance': True,
                'chromosome_size': 1000, 'circular': True,
                'initial_position': 0.005}
)

Convert from positions to circular distances:

>>> xf = ReformatPosition(distance=True, circular=True)
>>> xf((sample, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([  53, 455, 440,  52]),
    pos_format={'normalized': False, 'distance': True,
                'chromosome_size': 1000, 'circular': True,
                'initial_position': 5}
)
>>> xf = ReformatPosition(distance=True, normalized=True, circular=True)
>>> circ_norm = xf((sample, None, None))[0]
>>> circ_norm
SNPSample(
    snp=tensor(...),
    pos=tensor([0.0530, 0.4550, 0.4400, 0.0520], dtype=torch.float64),
    pos_format={'normalized': True, 'distance': True,
                'chromosome_size': 1000, 'circular': True,
                'initial_position': 0.005}
)

Test converting some circular distances, first from circular to non-circular:

>>> xf = ReformatPosition(circular=False)
>>> xf((circ_norm, None, None))[0]
SNPSample(
    snp=tensor(...),
    pos=tensor([0.0050, 0.4550, 0.4400, 0.0520], dtype=torch.float64),
    pos_format={'normalized': True, 'distance': True,
                'chromosome_size': 1000, 'circular': False,
                'initial_position': 0.005}
)
class dnadna.transforms.Rotate[source]

Bases: dnadna.transforms.Transform

Given a sequence, return a random rotation of it along the SNP axis.

Args:

None

class dnadna.transforms.SnpFormat(format='concat')[source]

Bases: dnadna.transforms.Transform

This transform specifies in what format the SNP matrix and position arrays are combined to form the input to the network.

Currently this can be one of:

  • concat: the position array and the SNP matrix are concatenated vertically with the position array becoming the first row of the tensor (this is the default, even if this transform is not used explicitly).

  • product: the SNP matrix is multiplied by the position array, so that each active site has the value of its position, rather than just 1.

class dnadna.transforms.Subsample(size, keep_polymorphic_only=True)[source]

Bases: dnadna.transforms.Transform

Subsample SNP matrix of size (n, k), with n individuals and k SNPs and return a matrix of size (m, l), with m individuals and m < n and l SNPs with l <= k because columns without SNP anymore are not kept.

Parameters
  • size (int, tuple, list) – Number of individuals to keep. If tuple/list, random value of individuals within the range defined by the tuple values.

  • keep_polymorphic_only (bool) – if true, SNPs that are not polymorphic are removed

class dnadna.transforms.Transform[source]

Bases: dnadna.utils.plugins.Pluggable

Dataset transform.

When loading SNPSamples from the dataset, these transforms are applied to the samples to modify either the position or SNP matrix arrays, or both.

To implement a transform you must provide its __call__ method, which takes as input a tuple consisting of the SNPSample being loaded from the dataset, as well as a the parameters being trained as a LearnedParams, and the parameter values associated with the sample’s scenario, as loaded from the Pandas DataFrame.

classmethod get_schema()[source]

Provide a schema for validating a single transform in a list of transforms in the config file (see the training config schema) for example usage).

exception dnadna.transforms.TransformException(transform)[source]

Bases: Exception

Exception raised when applying a Transform to an input.

Parameters

transform (dnadna.transforms.Transform) – The transform that caused the exception.

class dnadna.transforms.ValidateSnp(uniform_shape=True)[source]

Bases: dnadna.transforms.Transform

A special transform that does not actually modify the data, but merely performs certain verifications on it.

If verification fails the data sample will be excluded from batches returned by the data loader.

Currently there is only one verification supported, which is to verify that all SNPs have the same shape (same number of SNPs and individuals).

This can be combined e.g. with Crop to first crop the SNP sizes to a maximum size, then verify that they are of a consistent shape with previous SNPs in the dataset.

Keyword Arguments

uniform_shape (bool) – (optional) – Check whether all SNP samples in the dataset have the same shape (same number of SNPs and individuals).

dnadna.transforms.keep_polymorphic_only(sample)[source]

Remove sites that are no longer polymorphic in sample.

Both the SNP matrix and position vector are filtered. If position is encoded as distance, distances between SNPs are adjusted.

Parameters

sample (SNPSample) – sample to filter