dnadna.snp_sample

Implements the SNPSample class, a generic container for DNADNA’s SNP data, consisting of the SNP matrix itself and the SNP positions array. It includes built-in methods for reading SNP data from different file formats, as well as writing it out to different formats.

  • Classes for reading and writing SNPSample objects to/from different file formats and data representations. These are generally not used directly, but rather through methods on the SNPSample class itself using the SNPSample.to/from_<format> methods. The available formats can be listed like:

    >>> from dnadna.snp_sample import SNPSample
    >>> SNPSample.converter_formats
    ['dict', 'npz']
    

    Additional converters can be registered simply by defining subclasses of SNPConverter (make sure the modules the classes are in are actually imported).

Classes

DictSNPConverter(data[, keys])

Converts SNPSamples to/from a JSON-compatible dict format.

NpzSNPConverter(filename[, keys])

Serialize SNPSamples to/from NPZ files.

SNPConverter()

Base class for converters between SNPSample and other objects representing SNPs.

SNPLoader()

Base class for SNP loaders.

SNPSample([snp, pos, pos_format, …])

Class representing a single SNP sample from a population.

SNPSerializer()

Base class for SNPSample serializers.

class dnadna.snp_sample.DictSNPConverter(data, keys=('SNP', 'POS'))[source]

Bases: dnadna.snp_sample.SNPConverter, dnadna.snp_sample.SNPLoader

Converts SNPSamples to/from a JSON-compatible dict format.

Also acts as an SNPLoader for lazy-loading when DictSNPConverter.from_dict is passed lazy=True (the default).

See DictSNPConverter.convert_to for a description of the data format.

classmethod convert_from(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)

Convert a JSON-compatible data structure to an SNPSample.

See DictSNPConverter.convert_to for a description of the data format.

Examples

>>> from dnadna.snp_sample import SNPSample
>>> import numpy as np

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> sample = SNPSample(snp, pos)
>>> sample2 = SNPSample.from_dict(sample.to_dict())
>>> sample == sample2
True
convert_to(keys=('SNP', 'POS'))

Convert the SNPSample to a JSON-compatible representation.

This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the keys argument, which defaults to ('SNP', 'POS').

The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of 1 s and 0 s.

Examples

>>> from dnadna.snp_sample import SNPSample
>>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]]
>>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64)
>>> sample = SNPSample(snp, pos)
>>> sample.to_dict()
{'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
classmethod from_dict(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)[source]

Convert a JSON-compatible data structure to an SNPSample.

See DictSNPConverter.convert_to for a description of the data format.

Examples

>>> from dnadna.snp_sample import SNPSample
>>> import numpy as np

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> sample = SNPSample(snp, pos)
>>> sample2 = SNPSample.from_dict(sample.to_dict())
>>> sample == sample2
True
get_data()[source]

Returns the SNP matrix and position array of an SNP sample as a tuple of torch.Tensor.

Must be implemented by subclasses.

to_dict(keys=('SNP', 'POS'))[source]

Convert the SNPSample to a JSON-compatible representation.

This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the keys argument, which defaults to ('SNP', 'POS').

The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of 1 s and 0 s.

Examples

>>> from dnadna.snp_sample import SNPSample
>>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]]
>>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64)
>>> sample = SNPSample(snp, pos)
>>> sample.to_dict()
{'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
class dnadna.snp_sample.NpzSNPConverter(filename, keys=('SNP', 'POS'))[source]

Bases: dnadna.snp_sample.SNPSerializer, dnadna.snp_sample.SNPConverter, dnadna.snp_sample.SNPLoader

Serialize SNPSamples to/from NPZ files.

Provides SNPSample.to/from_npz methods.

Also acts as an SNPLoader for lazy-loading when NpzSNPConverter.from_npz is passed lazy=True (the default).

classmethod convert_from(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)

Read a SNPSample from a NumPy NPZ file.

An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default ('SNP', 'POS')) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.

Examples

>>> import numpy as np
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> np.savez(tmp / 'test.npz', SNP=snp, POS=pos)
>>> sample = SNPSample.from_npz(tmp / 'test.npz')
>>> (sample.snp.numpy() == snp).all()
True
>>> (sample.pos.numpy() == pos).all()
True
convert_to(filename, keys=('SNP', 'POS'), compressed=True)

Write a SNPSample to a NumPy NPZ file.

An NPZ file can contain multiple arrays, each keyed by an array name. See also NpzSNPConverter.load for the converse. The keys=('SNP', 'POS') argument can be overridden to save with different names for the SNP and position arrays.

If compressed=True (default) the NPZ archive is written with zip compression.

Examples

>>> import numpy as np
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> sample = SNPSample(snp, pos)
>>> sample.to_npz(tmp / 'test.npz')
>>> sample == SNPSample.from_npz(tmp / 'test.npz')
True
classmethod from_npz(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]

Read a SNPSample from a NumPy NPZ file.

An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default ('SNP', 'POS')) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.

Examples

>>> import numpy as np
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> np.savez(tmp / 'test.npz', SNP=snp, POS=pos)
>>> sample = SNPSample.from_npz(tmp / 'test.npz')
>>> (sample.snp.numpy() == snp).all()
True
>>> (sample.pos.numpy() == pos).all()
True
get_data()[source]

Returns the SNP matrix and position array of an SNP sample as a tuple of torch.Tensor.

Must be implemented by subclasses.

get_shape()[source]

For NPZ files it is possible to get the array shapes by reading the metadata without extracting the entire array.

It should be sufficient to find just the metadata for the SNP matrix.

Examples

>>> from dnadna.snp_sample import SNPSample, NpzSNPConverter
>>> import io
>>> out = io.BytesIO()
>>> snp = SNPSample([[1, 0], [0, 1], [1, 1]], [2, 3])
>>> snp.to_npz(out)
>>> out.seek(0)
0
>>> conv = NpzSNPConverter(out)
>>> conv.get_shape()
(3, 2)
classmethod load(filename_or_obj, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]

Implements the GenericSerializer interface for loading data from an NPZ file.

classmethod save(obj, filename, keys=('SNP', 'POS'), compressed=True)[source]

Implements the GenericSerializer interface for saving data to an NPZ file.

to_npz(filename, keys=('SNP', 'POS'), compressed=True)[source]

Write a SNPSample to a NumPy NPZ file.

An NPZ file can contain multiple arrays, each keyed by an array name. See also NpzSNPConverter.load for the converse. The keys=('SNP', 'POS') argument can be overridden to save with different names for the SNP and position arrays.

If compressed=True (default) the NPZ archive is written with zip compression.

Examples

>>> import numpy as np
>>> from dnadna.snp_sample import SNPSample
>>> tmp = getfixture('tmp_path')  # pytest-specific

Random SNP and position arrays:

>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8')
>>> pos = np.sort(np.random.random(10))
>>> sample = SNPSample(snp, pos)
>>> sample.to_npz(tmp / 'test.npz')
>>> sample == SNPSample.from_npz(tmp / 'test.npz')
True
class dnadna.snp_sample.SNPConverter[source]

Bases: object

Base class for converters between SNPSample and other objects representing SNPs.

Similar interface to GenericSerializer except the inputs and outputs need not be files. In the case of SNPSerializer they are files, but see DictSNPConverter for a counter-example.

abstract classmethod convert_from(obj, *args, **kwargs)[source]

Convert the given object to an SNPSample.

abstract convert_to(*args, **kwargs)[source]

Convert the given SNPSample to the desired output type.

Note

The way these classes are used is such that they are never instantiated, but are instead containers for methods on the SNPSample class itself (see _SNPSampleMeta in the source code).

This is because when the .convert_to() method is called, self is not an instance of an SNPConverter, but rather it is an instance of SNPSample.

property converters

List of all converter classes.

This is cached to speed up in the future, but it relies on recursively evaluating all its __subclasses__(). Therefore if any new subclasses are defined we need to invalidate the cache each time (see SNPConverter.__init_subclass__).

Examples

Test that this invalidation actually occurs when defining a new subclass:

>>> from dnadna.snp_sample import SNPConverter
>>> SNPConverter.formats
['dict', 'npz']
>>> class MyConverter(SNPConverter):
...     # note: it's not strictly necessary to define the to/from
...     #methods
...     format = 'my_format'
>>> SNPConverter.formats
['dict', 'my_format', 'npz']

There is, however, no way to “unregister” formats under this mechanism, but in practice that would be rare. We just have to delete the subclass and then manually perform the cache invalidation e.g. by manually calling __init_subclass__ in order to clean up:

>>> n_subclasses = len(SNPConverter.__subclasses__())
>>> del MyConverter
>>> SNPConverter.__init_subclass__()

Note: It’s not enough just to del MyConverter. Apparently type.__subclasses__ can still holds on to weak references (possibly as a weakref.WeakSet?) so there is a risk of resurrecting the deleted class if we try to rebuild the cache. Run a few rounds of garbage collection to really make sure it’s gone:

>>> import gc
>>> while len(SNPConverter.__subclasses__()) > n_subclasses - 1:
...     _ = gc.collect()
>>> SNPConverter.formats
['dict', 'npz']
abstract property format

Name of the format this implements (which may be different from the filename extension(s). This is used to generate to/from_<format> methods on SNPSample.

property formats

Returns just the format names of all registered non-abstract converters.

Examples

>>> from dnadna.snp_sample import SNPConverter
>>> SNPConverter.formats
['dict', 'npz']
class dnadna.snp_sample.SNPLoader[source]

Bases: object

Base class for SNP loaders.

A loader is used for lazy-loading of SNP data. While the SNPConverter classes are converting SNPSample objects to/from different formats (e.g. different file formats), a loader simply provides methods for getting the SNP matrix and position array data on-demand.

An SNPLoader must at minimum implement the SNPLoader.get_data method which returns a tuple of torch.Tensor objects for the SNP matrix and position arrays respectively.

It may optionally implement an SNPLoader.get_shape which returns a tuple (n_indiv, n_snp)–the number of SNPs and the number of individuals in the sample. This can be used as an optimization to get the dimensions of a sample without loading the full data.

abstract get_data()[source]

Returns the SNP matrix and position array of an SNP sample as a tuple of torch.Tensor.

Must be implemented by subclasses.

get_shape()[source]

Returns the dimensions of an SNPSample as a tuple of (n_indiv, n_snp).

The default implementation simply calls SNPLoader.get_data and returns the dimensions of the tensors. However, this may be overridden by subclasses to provide a more efficient implementation, e.g. that does not require loading the full data if there is metadata available to provide this information.

class dnadna.snp_sample.SNPSample(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, loader=None, validate=True)[source]

Bases: object

Class representing a single SNP sample from a population.

Consists of an array of shape (n, m) where n is the number of individuals in the sample and m is the number of SNPs, along with a 1-D array of shape (m,) of SNP positions in the nucleotide.

By default positions are assumed to be normalized to the range [0.0, 1.0] of absolute positions, but this can be changed with the pos_format argument (see below).

The SNP and pos arrays can be given in any type that can be easily converted to a torch.Tensor.

Keyword Arguments
  • snp (list, numpy.ndarray, torch.Tensor) – (optional) – The SNP matrix. Must be provided unless a loader is provided.

  • pos (list, numpy.ndarray, torch.Tensor) – (optional) – The positions array. Must be provided unless a loader is provided.

  • pos_format (dict) – (optional) – A dict specifying how the positions are formatted. It can currently contain up to 4 keys (see the position_format property in the dataset schema). If not specified, the default assumption is {'distance': False, 'circular': False, 'normalized': False}, though it will be inferred whether or not the positions are normalized if not otherwise specified.

  • path (object) – (optional) – The path from which this SNPSample was loaded. Typically this will be a filesystem path as a str or pathlib.Path, but it may be anything depending on how the SNPSample as loaded. This is included for informational purposes only.

  • copy (bool) – (optional) – If True the data underlying snp and pos arguments are always copied. If False (default) a copy will be avoided if possible, but may still be necessary (e.g. when converting a Python list to torch.Tensor, or when the dtype needs to be converted).

  • loader (SNPLoader) – (optional) – If provided, the snp and/or pos arguments may be omitted. A loader allows lazy-loading of SNP matrix data on-demand. See the documentation for SNPLoader.

  • validate (bool) – (optional) – Validate the formats of the SNP and position tensors. This can be disabled for efficiency if you are sure they are already in the correct format. When validate=False make sure also to supply a correct pos_format argument (default: True).

Examples

>>> from dnadna.snp_sample import SNPSample
>>> snp = [[1, 0, 0, 1], [0, 1, 1, 0]]
>>> pos = [0.2, 0.4, 0.6, 0.8]
>>> samp = SNPSample(snp, pos)
>>> samp.snp
tensor([[1, 0, 0, 1],
        [0, 1, 1, 0]])
>>> samp.pos
tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64)

The SNP and position arrays can be combined into a single array in one of two formats, .product which takes the product of the two arrays, with the position array multiplied along the individuals axis:

>>> samp.product
tensor([[0.2000, 0.0000, 0.0000, 0.8000],
        [0.0000, 0.4000, 0.6000, 0.0000]], dtype=torch.float64)

Or the two arrays can be simply concatenated into a (n + 1, m) array, with the first row containing the positions and the remaining rows containing the SNPs:

>>> samp.concat
tensor([[0.2000, 0.4000, 0.6000, 0.8000],
        [1.0000, 0.0000, 0.0000, 1.0000],
        [0.0000, 1.0000, 1.0000, 0.0000]], dtype=torch.float64)

Or just .tensor returns one or the other depending on the value of the .tensor_format attribute:

>>> bool((samp.concat == samp.tensor).all())
True
>>> samp2 = SNPSample(samp.snp, samp.pos, tensor_format='product')
>>> bool((samp2.product == samp2.tensor).all())
True

The optional path element (None by default) can give a data source-specific path from which the sample was read (typically a filename):

>>> SNPSample(snp, pos, path='one_event/scenario_000/one_event_000_0.npz')
SNPSample(
    snp=tensor([[1, 0, 0, 1],
                [0, 1, 1, 0]]),
    pos=tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64),
    pos_format={'normalized': True},
    path='one_event/scenario_000/one_event_000_0.npz'
)
property concat

The concatenation of the pos array with the snp array.

The result has the same dtype as the pos array.

property converter_formats

List the names of all converter formats available for SNPSample.

For each format in this list, there is are associated SNPSample.to_<format> and SNPSample.from_<format> methods available (where the latter is a classmethod).

Examples

>>> from dnadna.snp_sample import SNPSample
>>> SNPSample.converter_formats
['dict', 'npz']
>>> SNPSample.from_dict
<bound method DictSNPConverter.from_dict of
<class 'dnadna.snp_sample.DictSNPConverter'>>
>>> snp = SNPSample([[1, 0], [0, 1]], [0, 1])
<bound method DictSNPConverter.to_dict of SNPSample(
    snp=tensor([[1, 0, 1],
                [0, 1, 0],
                [0, 1, 0]], dtype=torch.uint8),
    pos=tensor([1, 2, 3],
    tensor_format='concat')
)>

As you can see in the above examples, the converter methods are actually defined on the DictSNPConverter class, but they are made available directly as methods on SNPSample.

See also dir of SNPSample for a list of methods:

>>> dir(SNPSample)
[...from_dict, from_npz, ..., to_dict, to_npz...]
copy()[source]

Creates a copy of this SNPSample, including copying the snp and pos tensors.

copy_with(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, validate=None)[source]

Creates a copy of this SNPSample instance with any of the fields replaced.

If copy=True the storage for the snp and pos tensors is also copied; otherwise the same storage is referenced in the new SNPSample.

classmethod from_file(filename_or_obj, **kwargs)[source]

Read an SNPSample from a file using one of the known SNPSerializer types. The serialization format will be determined by the filename.

In the case of file-like objects it must have a .name or .filename attribute in order to guess the format.

For a usage example, see SNPSample.to_file.

property full_pos_format

Return the user-provided pos_format merged with the default value.

property loader

The SNPLoader used for lazy-loading this SNPSample, if any.

property n_indiv

The number of individuals in the sample.

property n_snp

The number of SNPs in the sample.

property path

The path from which this SNPSample was loaded.

Typically this will be a filesystem path as a str or pathlib.Path, but it may be anything depending on how the SNPSample as loaded. This is included for informational purposes only.

property pos

The positions array.

property pos_format

A dict specifying how the positions are formatted.

It can currently contain up to 4 keys (see the position_format property in the dataset schema). If not specified, the default assumption is {'distance': False, 'circular': False, 'normalized': False}, though it will be inferred whether or not the positions are normalized if not otherwise specified.

property product

The product of the pos array with the snp array.

The result has the same dtype as the pos array.

property shape[source]

The number of SNPs and number of individuals as a tuple.

property snp

The SNP matrix.

property tensor

Either SNPSample.concat or SNPSample.product depending on the value of SNPSample.tensor_format.

property tensor_format

The default format for SNPSample.tensor on this SNPSample.

If 'concat', it is equivalent to SNPSample.concat, and if 'product' it is equivalent to SNPSample.product (default: 'concat').

to_file(filename_or_obj, **kwargs)[source]

Serialize the SNPSample to a file or file-like object.

The appropriate serializer will be determined by the filename, as in SNPSample.from_file.

Examples

>>> import io
>>> from dnadna.snp_sample import SNPSample
>>> out = io.BytesIO()

A filename ending with .npz indicates the NPZ-based DNADNA format:

>>> out.name = 'out.npz'
>>> snp = SNPSample([[0, 1], [0, 0]], [0.1, 0.2])
>>> snp.to_file(out)
>>> _ = out.seek(0)
>>> snp2 = SNPSample.from_file(out)
>>> snp == snp2
True
class dnadna.snp_sample.SNPSerializer[source]

Bases: dnadna.utils.serializers.GenericSerializer

Base class for SNPSample serializers.