dnadna.snp_sample
Implements the SNPSample
class, a generic container for
DNADNA’s SNP data, consisting of the SNP matrix itself and the SNP positions
array. It includes built-in methods for reading SNP data from different file
formats, as well as writing it out to different formats.
Classes for reading and writing
SNPSample
objects to/from different file formats and data representations. These are generally not used directly, but rather through methods on theSNPSample
class itself using theSNPSample.to/from_<format>
methods. The available formats can be listed like:>>> from dnadna.snp_sample import SNPSample >>> SNPSample.converter_formats ['dict', 'npz']
DictSNPConverter
- convertsSNPSample
to/from a JSON-serializabledict
-based format.NpzSNPConverter
- serializes and deserializes anSNPSample
to/from an NPZ file.
Additional converters can be registered simply by defining subclasses of
SNPConverter
(make sure the modules the classes are in are actually imported).
Classes
|
Converts |
|
Serialize |
Base class for converters between |
|
Base class for SNP loaders. |
|
|
Class representing a single SNP sample from a population. |
Base class for |
- class dnadna.snp_sample.DictSNPConverter(data, keys=('SNP', 'POS'))[source]
Bases:
dnadna.snp_sample.SNPConverter
,dnadna.snp_sample.SNPLoader
Converts
SNPSamples
to/from a JSON-compatible dict format.Also acts as an
SNPLoader
for lazy-loading whenDictSNPConverter.from_dict
is passedlazy=True
(the default).See
DictSNPConverter.convert_to
for a description of the data format.- classmethod convert_from(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)
Convert a JSON-compatible data structure to an
SNPSample
.See
DictSNPConverter.convert_to
for a description of the data format.Examples
>>> from dnadna.snp_sample import SNPSample >>> import numpy as np
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample2 = SNPSample.from_dict(sample.to_dict()) >>> sample == sample2 True
- convert_to(keys=('SNP', 'POS'))
Convert the
SNPSample
to a JSON-compatible representation.This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the
keys
argument, which defaults to('SNP', 'POS')
.The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of
1
s and0
s.Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]] >>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64) >>> sample = SNPSample(snp, pos) >>> sample.to_dict() {'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
- classmethod from_dict(data, keys=('SNP', 'POS'), pos_format=None, path=None, lazy=True)[source]
Convert a JSON-compatible data structure to an
SNPSample
.See
DictSNPConverter.convert_to
for a description of the data format.Examples
>>> from dnadna.snp_sample import SNPSample >>> import numpy as np
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample2 = SNPSample.from_dict(sample.to_dict()) >>> sample == sample2 True
- get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor
.Must be implemented by subclasses.
- to_dict(keys=('SNP', 'POS'))[source]
Convert the
SNPSample
to a JSON-compatible representation.This format is similar to the NPZ format in that the SNP matrix and position arrays are output to properties given by the
keys
argument, which defaults to('SNP', 'POS')
.The position array is written as a JSON array of floats. The SNP matrix is written in a compact representation consisting of an array of SNPs, with each SNP represented as a string of
1
s and0
s.Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 1], [0, 1, 0], [1, 1, 0]] >>> pos = np.array([0.1, 0.2, 0.3], dtype=np.float64) >>> sample = SNPSample(snp, pos) >>> sample.to_dict() {'SNP': ['101', '010', '110'], 'POS': [0.1, 0.2, 0.3]}
- class dnadna.snp_sample.NpzSNPConverter(filename, keys=('SNP', 'POS'))[source]
Bases:
dnadna.snp_sample.SNPSerializer
,dnadna.snp_sample.SNPConverter
,dnadna.snp_sample.SNPLoader
Serialize
SNPSamples
to/from NPZ files.Provides
SNPSample.to/from_npz
methods.Also acts as an
SNPLoader
for lazy-loading whenNpzSNPConverter.from_npz
is passedlazy=True
(the default).- classmethod convert_from(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)
Read a
SNPSample
from a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default
('SNP', 'POS')
) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.Examples
>>> import numpy as np >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> np.savez(tmp / 'test.npz', SNP=snp, POS=pos) >>> sample = SNPSample.from_npz(tmp / 'test.npz') >>> (sample.snp.numpy() == snp).all() True >>> (sample.pos.numpy() == pos).all() True
- convert_to(filename, keys=('SNP', 'POS'), compressed=True)
Write a
SNPSample
to a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. See also
NpzSNPConverter.load
for the converse. Thekeys=('SNP', 'POS')
argument can be overridden to save with different names for the SNP and position arrays.If
compressed=True
(default) the NPZ archive is written with zip compression.Examples
>>> import numpy as np >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample.to_npz(tmp / 'test.npz') >>> sample == SNPSample.from_npz(tmp / 'test.npz') True
- classmethod from_npz(filename, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]
Read a
SNPSample
from a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. For SNP samples it is assumed that a given NPZ file contains at least a SNP matrix array and a position array. The argument keys (default
('SNP', 'POS')
) should be a 2-tuple giving the array names to look for in the SNP file for the SNP matrix and the positions respectively.Examples
>>> import numpy as np >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> np.savez(tmp / 'test.npz', SNP=snp, POS=pos) >>> sample = SNPSample.from_npz(tmp / 'test.npz') >>> (sample.snp.numpy() == snp).all() True >>> (sample.pos.numpy() == pos).all() True
- get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor
.Must be implemented by subclasses.
- get_shape()[source]
For NPZ files it is possible to get the array shapes by reading the metadata without extracting the entire array.
It should be sufficient to find just the metadata for the SNP matrix.
Examples
>>> from dnadna.snp_sample import SNPSample, NpzSNPConverter >>> import io >>> out = io.BytesIO() >>> snp = SNPSample([[1, 0], [0, 1], [1, 1]], [2, 3]) >>> snp.to_npz(out) >>> out.seek(0) 0 >>> conv = NpzSNPConverter(out) >>> conv.get_shape() (3, 2)
- classmethod load(filename_or_obj, keys=('SNP', 'POS'), pos_format=None, lazy=True)[source]
Implements the
GenericSerializer
interface for loading data from an NPZ file.
- classmethod save(obj, filename, keys=('SNP', 'POS'), compressed=True)[source]
Implements the
GenericSerializer
interface for saving data to an NPZ file.
- to_npz(filename, keys=('SNP', 'POS'), compressed=True)[source]
Write a
SNPSample
to a NumPy NPZ file.An NPZ file can contain multiple arrays, each keyed by an array name. See also
NpzSNPConverter.load
for the converse. Thekeys=('SNP', 'POS')
argument can be overridden to save with different names for the SNP and position arrays.If
compressed=True
(default) the NPZ archive is written with zip compression.Examples
>>> import numpy as np >>> from dnadna.snp_sample import SNPSample >>> tmp = getfixture('tmp_path') # pytest-specific
Random SNP and position arrays:
>>> snp = (np.random.random((10, 10)) >= 0.5).astype('uint8') >>> pos = np.sort(np.random.random(10)) >>> sample = SNPSample(snp, pos) >>> sample.to_npz(tmp / 'test.npz') >>> sample == SNPSample.from_npz(tmp / 'test.npz') True
- class dnadna.snp_sample.SNPConverter[source]
Bases:
object
Base class for converters between
SNPSample
and other objects representing SNPs.Similar interface to
GenericSerializer
except the inputs and outputs need not be files. In the case ofSNPSerializer
they are files, but seeDictSNPConverter
for a counter-example.- abstract classmethod convert_from(obj, *args, **kwargs)[source]
Convert the given object to an
SNPSample
.
- abstract convert_to(*args, **kwargs)[source]
Convert the given
SNPSample
to the desired output type.Note
The way these classes are used is such that they are never instantiated, but are instead containers for methods on the
SNPSample
class itself (see_SNPSampleMeta
in the source code).This is because when the
.convert_to()
method is called,self
is not an instance of anSNPConverter
, but rather it is an instance ofSNPSample
.
- property converters
List of all converter classes.
This is cached to speed up in the future, but it relies on recursively evaluating all its
__subclasses__()
. Therefore if any new subclasses are defined we need to invalidate the cache each time (seeSNPConverter.__init_subclass__
).Examples
Test that this invalidation actually occurs when defining a new subclass:
>>> from dnadna.snp_sample import SNPConverter >>> SNPConverter.formats ['dict', 'npz'] >>> class MyConverter(SNPConverter): ... # note: it's not strictly necessary to define the to/from ... #methods ... format = 'my_format' >>> SNPConverter.formats ['dict', 'my_format', 'npz']
There is, however, no way to “unregister” formats under this mechanism, but in practice that would be rare. We just have to delete the subclass and then manually perform the cache invalidation e.g. by manually calling
__init_subclass__
in order to clean up:>>> n_subclasses = len(SNPConverter.__subclasses__()) >>> del MyConverter >>> SNPConverter.__init_subclass__()
Note: It’s not enough just to
del MyConverter
. Apparentlytype.__subclasses__
can still holds on to weak references (possibly as aweakref.WeakSet
?) so there is a risk of resurrecting the deleted class if we try to rebuild the cache. Run a few rounds of garbage collection to really make sure it’s gone:>>> import gc >>> while len(SNPConverter.__subclasses__()) > n_subclasses - 1: ... _ = gc.collect() >>> SNPConverter.formats ['dict', 'npz']
- abstract property format
Name of the format this implements (which may be different from the filename extension(s). This is used to generate
to/from_<format>
methods onSNPSample
.
- property formats
Returns just the format names of all registered non-abstract converters.
Examples
>>> from dnadna.snp_sample import SNPConverter >>> SNPConverter.formats ['dict', 'npz']
- class dnadna.snp_sample.SNPLoader[source]
Bases:
object
Base class for SNP loaders.
A loader is used for lazy-loading of SNP data. While the
SNPConverter
classes are convertingSNPSample
objects to/from different formats (e.g. different file formats), a loader simply provides methods for getting the SNP matrix and position array data on-demand.An
SNPLoader
must at minimum implement theSNPLoader.get_data
method which returns a tuple oftorch.Tensor
objects for the SNP matrix and position arrays respectively.It may optionally implement an
SNPLoader.get_shape
which returns a tuple(n_indiv, n_snp)
–the number of SNPs and the number of individuals in the sample. This can be used as an optimization to get the dimensions of a sample without loading the full data.- abstract get_data()[source]
Returns the SNP matrix and position array of an SNP sample as a tuple of
torch.Tensor
.Must be implemented by subclasses.
- get_shape()[source]
Returns the dimensions of an
SNPSample
as a tuple of(n_indiv, n_snp)
.The default implementation simply calls
SNPLoader.get_data
and returns the dimensions of the tensors. However, this may be overridden by subclasses to provide a more efficient implementation, e.g. that does not require loading the full data if there is metadata available to provide this information.
- class dnadna.snp_sample.SNPSample(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, loader=None, validate=True)[source]
Bases:
object
Class representing a single SNP sample from a population.
Consists of an array of shape
(n, m)
wheren
is the number of individuals in the sample andm
is the number of SNPs, along with a 1-D array of shape(m,)
of SNP positions in the nucleotide.By default positions are assumed to be normalized to the range
[0.0, 1.0]
of absolute positions, but this can be changed with thepos_format
argument (see below).The SNP and pos arrays can be given in any type that can be easily converted to a
torch.Tensor
.- Keyword Arguments
snp (
list
,numpy.ndarray
,torch.Tensor
) – (optional) – The SNP matrix. Must be provided unless aloader
is provided.pos (
list
,numpy.ndarray
,torch.Tensor
) – (optional) – The positions array. Must be provided unless aloader
is provided.pos_format (
dict
) – (optional) – Adict
specifying how the positions are formatted. It can currently contain up to 4 keys (see theposition_format
property in the dataset schema). If not specified, the default assumption is{'distance': False, 'circular': False, 'normalized': False}
, though it will be inferred whether or not the positions are normalized if not otherwise specified.path (object) – (optional) – The path from which this
SNPSample
was loaded. Typically this will be a filesystem path as astr
orpathlib.Path
, but it may be anything depending on how theSNPSample
as loaded. This is included for informational purposes only.copy (bool) – (optional) – If
True
the data underlyingsnp
andpos
arguments are always copied. IfFalse
(default) a copy will be avoided if possible, but may still be necessary (e.g. when converting a Pythonlist
totorch.Tensor
, or when the dtype needs to be converted).loader (
SNPLoader
) – (optional) – If provided, thesnp
and/orpos
arguments may be omitted. A loader allows lazy-loading of SNP matrix data on-demand. See the documentation forSNPLoader
.validate (bool) – (optional) – Validate the formats of the SNP and position tensors. This can be disabled for efficiency if you are sure they are already in the correct format. When
validate=False
make sure also to supply a correctpos_format
argument (default: True).
Examples
>>> from dnadna.snp_sample import SNPSample >>> snp = [[1, 0, 0, 1], [0, 1, 1, 0]] >>> pos = [0.2, 0.4, 0.6, 0.8] >>> samp = SNPSample(snp, pos) >>> samp.snp tensor([[1, 0, 0, 1], [0, 1, 1, 0]]) >>> samp.pos tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64)
The SNP and position arrays can be combined into a single array in one of two formats,
.product
which takes the product of the two arrays, with the position array multiplied along the individuals axis:>>> samp.product tensor([[0.2000, 0.0000, 0.0000, 0.8000], [0.0000, 0.4000, 0.6000, 0.0000]], dtype=torch.float64)
Or the two arrays can be simply concatenated into a
(n + 1, m)
array, with the first row containing the positions and the remaining rows containing the SNPs:>>> samp.concat tensor([[0.2000, 0.4000, 0.6000, 0.8000], [1.0000, 0.0000, 0.0000, 1.0000], [0.0000, 1.0000, 1.0000, 0.0000]], dtype=torch.float64)
Or just
.tensor
returns one or the other depending on the value of the.tensor_format
attribute:>>> bool((samp.concat == samp.tensor).all()) True >>> samp2 = SNPSample(samp.snp, samp.pos, tensor_format='product') >>> bool((samp2.product == samp2.tensor).all()) True
The optional
path
element (None
by default) can give a data source-specific path from which the sample was read (typically a filename):>>> SNPSample(snp, pos, path='one_event/scenario_000/one_event_000_0.npz') SNPSample( snp=tensor([[1, 0, 0, 1], [0, 1, 1, 0]]), pos=tensor([0.2000, 0.4000, 0.6000, 0.8000], dtype=torch.float64), pos_format={'normalized': True}, path='one_event/scenario_000/one_event_000_0.npz' )
- property concat
The concatenation of the
pos
array with thesnp
array.The result has the same dtype as the
pos
array.
- property converter_formats
List the names of all converter formats available for
SNPSample
.For each format in this list, there is are associated
SNPSample.to_<format>
andSNPSample.from_<format>
methods available (where the latter is aclassmethod
).Examples
>>> from dnadna.snp_sample import SNPSample >>> SNPSample.converter_formats ['dict', 'npz'] >>> SNPSample.from_dict <bound method DictSNPConverter.from_dict of <class 'dnadna.snp_sample.DictSNPConverter'>> >>> snp = SNPSample([[1, 0], [0, 1]], [0, 1]) <bound method DictSNPConverter.to_dict of SNPSample( snp=tensor([[1, 0, 1], [0, 1, 0], [0, 1, 0]], dtype=torch.uint8), pos=tensor([1, 2, 3], tensor_format='concat') )>
As you can see in the above examples, the converter methods are actually defined on the
DictSNPConverter
class, but they are made available directly as methods onSNPSample
.See also
dir
ofSNPSample
for a list of methods:>>> dir(SNPSample) [...from_dict, from_npz, ..., to_dict, to_npz...]
- copy_with(snp=None, pos=None, pos_format=None, tensor_format=None, path=None, copy=False, validate=None)[source]
Creates a copy of this
SNPSample
instance with any of the fields replaced.If
copy=True
the storage for thesnp
andpos
tensors is also copied; otherwise the same storage is referenced in the newSNPSample
.
- classmethod from_file(filename_or_obj, **kwargs)[source]
Read an
SNPSample
from a file using one of the knownSNPSerializer
types. The serialization format will be determined by the filename.In the case of file-like objects it must have a
.name
or.filename
attribute in order to guess the format.For a usage example, see
SNPSample.to_file
.
- property full_pos_format
Return the user-provided
pos_format
merged with the default value.
- property n_indiv
The number of individuals in the sample.
- property n_snp
The number of SNPs in the sample.
- property path
The path from which this
SNPSample
was loaded.Typically this will be a filesystem path as a
str
orpathlib.Path
, but it may be anything depending on how theSNPSample
as loaded. This is included for informational purposes only.
- property pos
The positions array.
- property pos_format
A
dict
specifying how the positions are formatted.It can currently contain up to 4 keys (see the
position_format
property in the dataset schema). If not specified, the default assumption is{'distance': False, 'circular': False, 'normalized': False}
, though it will be inferred whether or not the positions are normalized if not otherwise specified.
- property product
The product of the
pos
array with thesnp
array.The result has the same dtype as the
pos
array.
- property snp
The SNP matrix.
- property tensor
Either
SNPSample.concat
orSNPSample.product
depending on the value ofSNPSample.tensor_format
.
- property tensor_format
The default format for
SNPSample.tensor
on thisSNPSample
.If
'concat'
, it is equivalent toSNPSample.concat
, and if'product'
it is equivalent toSNPSample.product
(default:'concat'
).
- to_file(filename_or_obj, **kwargs)[source]
Serialize the
SNPSample
to a file or file-like object.The appropriate serializer will be determined by the filename, as in
SNPSample.from_file
.Examples
>>> import io >>> from dnadna.snp_sample import SNPSample >>> out = io.BytesIO()
A filename ending with
.npz
indicates the NPZ-based DNADNA format:>>> out.name = 'out.npz' >>> snp = SNPSample([[0, 1], [0, 0]], [0.1, 0.2]) >>> snp.to_file(out) >>> _ = out.seek(0) >>> snp2 = SNPSample.from_file(out) >>> snp == snp2 True
- class dnadna.snp_sample.SNPSerializer[source]
Bases:
dnadna.utils.serializers.GenericSerializer
Base class for
SNPSample
serializers.