Missing data¶
Supporting missing data¶
Consider the following alignment of 10 sequences, with 30 sites, of which 10 are polymorphic (those at indexes: 2, 5, 7, 8, 12, 13, 14, 18, 22, and 26):
GCACCATCATGACTTACGGTATTATCGTAT
.....T......A.A...........C...
.....T.GT...A.................
......................A...C...
......................A...C...
..T..T.GT.............A...C...
..T..T.......G................
..T..T........A...........C...
..T....GT.....A...C...A...C...
..T..T........A...C.......C...
Assumed it is saved a Fasta in align6A.fas
. Without missing data,
computing diversity is relatively straightforward:
>>> cs = egglib.stats.ComputeStats()
>>> cs.add_stats('lseff', 'nseff', 'S', 'sites')
>>> alnA = egglib.io.from_fasta('align6A.fas',egglib.alphabets.DNA)
>>> print(cs.process_align(alnA))
{'sites': [2, 5, 7, 8, 12, 13, 14, 18, 22, 26], 'lseff': 30, 'S': 10, 'nseff': 10.0}
Suppose now that the same data were obtained but with less than a perfect sequencing technology and that consequently some data are missing:
GCACCATCATGACTTACGGTATTATCGTAT
.....T......A.A...........C...
.....T.GT...A.................
......................A...C...
......................A...C...
..T..T.GW....K....S...A...C...
..T..T.......G............S...
..T..T........A...........S...
..T....GT.....A...C...A...C...
..T..T........A.??????????????
By default, all sites with at least one missing data will be excluded, causing a significant loss of data:
>>> alnB = egglib.io.from_fasta('align6B.fas',egglib.alphabets.DNA)
>>> print(cs.process_align(alnB))
{'sites': [2, 5, 7, 12, 14], 'lseff': 14, 'S': 5, 'nseff': 10.0}
Note that it is important to always refer to lseff
to know how many
sites have been actually used for the analysis (usually smaller than
ls
). In this case, only 14 of the 30 sites are
considered. In particular, the sites 8 and 13, which are polymorphic,
are excluded because the 6th sequence has ambiguity data. Furthermore,
all sites after index 15 are excluded because the last sequence (and
sometimes other ones) has missing data. Among those 14 sites included,
only 5 remain polymorphic.
EggLib has a specific option to allow considering sites even if they
contain missing data, up to a user-specified threshold. In
stats.ComputeStats.process_align()
, the threshold is expressed as a
proportion. Assume we want to allow for only one missing data (10%). We
can do it as shown below:
>>> print(cs.process_align(alnB, max_missing=0.1))
{'sites': [2, 5, 7, 8, 12, 13, 14, 22], 'lseff': 28, 'S': 8, 'nseff': 9.5}
This analysis has included 28 sites: all except two that have more than one
missing piece of data (sites at indexes 18 and 26, which appear to be both polymorphic).
The statistic nseff
gives the average number of samples in sites included in the
analysis. While max_missing was set to the default (0), all included sites
necessarily had have all samples, so the value was always equal to 10. Now,
included sites may have less than all samples available. For 14 of them all
samples are available, while the 14 others have one missing data, thereby the
average value of 9.5. Whenever possible, computations take this effect into
account and in all cases frequencies are expressed relatively to the effective
number of samples for a given site.
It is possible to be even more liberal and allow for more missing data. Here, we need to allow for at least 3 missing data to accept all sites (and detect all polymorphic sites):
>>> print(cs.process_align(alnB, max_missing=0.3))
{'sites': [2, 5, 7, 8, 12, 13, 14, 18, 22, 26], 'lseff': 30, 'S': 10, 'nseff': 9.366666666666667}
Eventually, all sites are processed and all 10 polymorphic sites are detected, while the average number of samples in used sites drops to 9.37.
The cost of increasing the value for max_missing is primarily that large values (if there are many missing data) lead to including a lot of sites with few exploitable samples for which all statistics will be poorly estimated, which will be in the end counter-productive.
Linkage disequilibrium statistics¶
Part of the statistics connected to linkage between distant sites can be
computed using the stats.ComputeStats
class. Other require separate
tools, also available in the stats module.
From ComputeStats
¶
A number of linkage-related statistics are available among the list of
statistics proposed by ComputeStats
. They include proper
linkage disequilibrium statistics such as Kelly et al.’s and Rozas
et al.’s statistics (such as Za
, ZZ
and ZnS
) and closely
related as Hudson’s Rmin
. We can also consider that statistics
based on haplotypes are somewhat connected (including K
, Hudson’s
Fst
, Fu’s Fs
, but also Ramos-Onsins and Rozas’s R2
or
Wall’s B
and Q
). Please refer to Diversity statistics for the
full list. It should be noted that all those statistics require that
the data are phased, meaning that the order of samples must match over
all provided sites (including the order of alleles within individuals
if the individuals are not haploid). This may seem obvious, but keep in
mind that you may compute haplotype or linkage disequilibrium
statistics over completely unrelated sites (with non-matching lists of
samples). Provided that the number of samples matches, EggLib will
compute all statistics you ask and it is up to you to decide whether
they are meaningful or not.
This being said, you can compute linkage disequilibrium and haplotype
statistics using the methods process_align()
and
process_sites()
of ComputeStats
. In both cases, it
will be assumed that the data are phased. When using an alignment,
computing this category of statistics is rather straighforward, but
note that it is also possible to process a list of Site
instances as in the example below:
>>> site1 = egglib.site_from_list('AAAAAAAACCCCCCCC', egglib.alphabets.DNA)
>>> site2 = egglib.site_from_list('GGGGGGGGGGGTTTTT', egglib.alphabets.DNA)
>>> site3 = egglib.site_from_list('CCCCCCAAAAAAAAAA', egglib.alphabets.DNA)
>>> site4 = egglib.site_from_list('TTTTAAAAAAATTTTT', egglib.alphabets.DNA)
>>> site5 = egglib.site_from_list('CCGGGGGGGGGGCCCG', egglib.alphabets.DNA)
>>> site6 = egglib.site_from_list('AATTAAAAAAAAAAAT', egglib.alphabets.DNA)
>>> cs = egglib.stats.ComputeStats()
>>> cs.add_stats('Rmin', 'Rintervals', 'ZnS', 'Ki')
>>> print(cs.process_sites([site1, site2, site3, site4, site5, site6]))
{'Rintervals': [(2, 3)], 'Rmin': 1, 'Ki': 8, 'ZnS': 0.17767944289156412}
Note that several options can be set using configure()
to control the computation of several statistics in this category (refer to
the documentation of this method for all details).
Note also that the presence of missing data can be a problem, since, on one hand, a sequence that contain any missing data cannot be easily used while identifying haplotypes and, on the other hand, the effect of missing data is magnified in the context of pairwise comparisons. To compute this family of statistics, it can be better to remove samples for which the amount of missing data is above a given threshold while keeping the argument max_missing to a (very) low value.
Pairwise linkage disequilibrium¶
There are two functions in the stats module to compute linkage
disequilibrium between sites: one to process a pair of sites (stats.pairwise_LD()
)
and one to process all sites for an alignment (stats.matrix_LD()
).
The stats.pairwise_LD()
function¶
This function takes two sites as arguments. It is up to the user to provide
sites with little enough missing data to make the computation relevant.
The function has options to control what happens if there are more than two
alleles at either site (this is not addressed here; see the function’s documentation
for more details).
The fragment of code below shows what the function does:
>>> print(egglib.stats.pairwise_LD(site1, site2))
{'D': 0.15625, 'Dp': 1.0, 'r': 0.674199862463242, 'rsq': 0.4545454545454545, 'n': 1}
>>> print(egglib.stats.pairwise_LD(site1, site4))
{'D': -0.03125, 'Dp': -0.14285714285714285, 'r': -0.1259881576697424, 'rsq': 0.01587301587301587, 'n': 1}
The values are:
n
, the number of pair of alleles considered (significant if there are more than two alleles at either site).
D
, linkage disequilibrium.
Dp
, Lewontin’s D’.
r
, Pearson’s correlation coefficient.
rsq
, squared Pearson’s correlation coefficient.
Note that if statistics cannot be computed (typically, because of the presence of
missing data), values in the returned dictionary are replaced by None
.
The stats.matrix_LD()
function¶
This function generates the linkage disequilibrium matrix between all pairs of sites (for which computation is possible) from a user-provided alignment. This function has a bunch of options:
Options to control what happens if there are more than two alleles at either site.
Options to apply filters on sites based on allelic frequencies (to exclude sites with too unbalanced frequencies, which are less informative) and the number of available data.
List of positions of sites provided in the alignment (by default, the index of sites is used).
The function requires two mandatory arguments: an Align
instance and
the list of statistics: d
, D
, Dp
, r
, and rsq
, as listed
above.
In this example, we consider a very simple alignment of 10 sequences and
10 sites (of which only 3 are variable), saved in a Fasta-formatted file
named align7.fas
:
>sample01
CACATGTGGA
>sample02
CACATGTGGA
>sample03
CAGATGTGGT
>sample04
CAGATGTGGT
>sample05
CACATGTAGT
>sample06
CACATGTAGT
>sample07
CAGATGTGGT
>sample08
CAGATGTGGT
>sample09
CACATGTGGT
>sample10
CACATGTGGA
The code below demonstrates the usage of stats.matrix_LD()
and will help
describe its return value:
>>> aln = egglib.io.from_fasta('align7.fas', egglib.alphabets.DNA)
>>> print(egglib.stats.matrix_LD(aln, ('d', 'rsq')))
([2.0, 7.0, 9.0], [[None], [[5.0, 0.16666666666666652], None],
[[7.0, 0.28571428571428575], [2.0, 0.10714285714285716], None]])
The usage is fairly straighforward. Just remember to specify the list
of statistics you want to be computed (in this case, we requested d
and rsq
) and consider if some of the optional arguments are needed.
The return value is a tuple
of two lists. The first list is
the position of sites that have been included in the analysis, in this
case the three polymorphic sites, which appear to be at positions 2, 7,
and 9 (there is currently an automatic conversion to floating-point
number but this may change in the future).
The second item of the return value is a nested list, containing the lower half-matrix with requested values. The additional example below shows how to collect results in such as way that we loop over all pairs of sites:
>>> pos, mat = egglib.stats.matrix_LD(aln, ('d', 'rsq'))
>>> n = len(pos)
>>> for i in range(n):
... for j in range(i):
... p1 = pos[i]
... p2 = pos[j]
... d = mat[i][j][0]
... r2 = mat[i][j][1]
... print('pos:', p1, p2, 'd:', d, 'r2:', r2)
...
pos: 7.0 2.0 d: 5.0 r2: 0.166666666667
pos: 9.0 2.0 d: 7.0 r2: 0.285714285714
pos: 9.0 7.0 d: 2.0 r2: 0.107142857143
Note that there is a “gotcha” with this function. If one requests a
class:!list of statistics as the stats argument, the values in
the half-matrix provided as the second return value are a list
of values, even if only one statistic is requested, as in:
>>> print(egglib.stats.matrix_LD(aln, ['rsq']))
([2.0, 7.0, 9.0], [[None], [[0.16666666666666652], None], [[0.28571428571428575], [0.10714285714285716], None]])
However, if one passes a statistic code (not an iterable) as value for the stat option, the items in the returned half-matrix are statistic values, not embedded in a list:
>>> print(egglib.stats.matrix_LD(aln, 'rsq'))
([2.0, 7.0, 9.0], [[None], [0.16666666666666652, None], [0.28571428571428575, 0.10714285714285716, None]])
The diagonal values are always None
.
EHH statistics¶
Extended haplotype homozygosity (EHH) statistics are one a set of
statistics specifically developed with the aim of exploiting
large-scale sequencing data. The documentation for the class
stats.EHH
provides all details and the list of literature
references. In this section we provide a quick overview of the usage of
this class.
EHH would be too complex to be included among statistics proposed by
stats.ComputeStats
. It is the reason why it has a class of its own.
To use it, one must therefore first create a class instance:
>>> ehh = egglib.stats.EHH()
Loading the core site¶
The first step consists in loading a core site, which will be used as reference
for all computations. In the original paper, it was a non-recombining region for
which haplotypes have been determined. It can as well be a two-allele SNP marker.
Assume that we have a file named sites1.txt
, of which the first line is:
0000000000000000010002000000010001001000000000000001000000002000000100000000101000000101000010010010
This is one way to represent haplotypic data for 100 individuals in a simple manner.
We will use simple code to import this string of haplotype identifiers as a
Site
:
>>> f = open('sites1.txt')
>>> core = list(map(int,f.readline().strip()))
>>> site = egglib.Site()
>>> site.from_list(core, egglib.alphabets.positive_infinite)
There is a method to set a Site
instance as the core site for
EHH
, but first we need to set its position because it will be
needed to compute EHH statistics. But, when created from a list,
Site
instances don’t have a position. Here we will record
distances relatively to the core site so we set its position to 0.
>>> site.position = 0
>>> ehh.set_core(site)
This method takes an array of options, one of which (min_freq) controlling the frequency threshold (to exclude low-frequency haplotypes from the analysis). But we don’t use it in this example.
After the core site has been loaded, basic statistics can be accessed:
>>> print(ehh.num_haplotypes)
3
>>> print(ehh.nsam)
100
>>> print([ehh.nsam_core(i) for i in range(ehh.num_haplotypes)])
[85, 13, 2]
These three lines show, respectively:
num_haplotypes
, the number of core haplotypes (if haplotypes have been excluded due to the min_freq option, this property gives the number of included haplotypes).nsam
, the total number of samples (only considering included haplotypes).nsam_core()
, the absolute frequency (at the core site) of a core haplotype.
Below we show the initial value of some of the EHH statistics (they actually have a value after loading the core site, but it is only after loading other sites that they can be interpreted):
>>> print(ehh.cur_haplotypes)
3
>>> print(ehh.get_EHH(0))
1.0
>>> print(ehh.get_EHH(1))
1.0
>>> print(ehh.get_EHH(2))
1.0
>>> print(ehh.get_rEHH(0))
1.32911392405
>>> print(ehh.get_iHH(0))
0.0
>>> print(ehh.get_iHS(0))
None
>>> print(ehh.get_EHHS())
1.0
>>> print(ehh.get_iES())
0.0
The table below lists the statistics shown (more are available):
Statistic |
Meaning |
Initial value |
---|---|---|
|
current number of haplotypes |
equal to |
|
EHH value for haplotype |
equal to 1 |
|
Ratio of EHH and EHHc for haplotype |
not defined |
|
Integrated value of EHH |
equal to 0 |
|
Ratio of iHH and its complement iHHc |
not computable because both terms are 0 |
|
Site-wise EHHS |
equal to 1 |
|
Integrated value of iES |
equal to 0 |
Loading the distant sites¶
After this, we can start to load other sites (referred to as distant sites).
Distant sites are classically SNP markers located at increasing distance from the
core site (the increasing distance is required). Here we continue reading the
file sites1.txt
which contains, after the core site, biallelic sites
formatted as shown below, taking the first 5 sites as example:
0000000000000000010000000000010001001000000000000001000000000000000100000000100000000101000010000000
0000100001000100000000000000000010000100001001001000100000100001100000100000000010000010100000000100
0000000000000000010000000000010001001000000000000001000000000000000100000000101000000101000010010010
0000100001000100010000000000010001001000000000000001000000000000000100100000101000000101000010010010
0000000000100000000000000000000000000000000000000000000000000000000000000100000000000000000100000000
The code below processes the next site in file (using a similar syntax than for the core site)
and pass it to the stats.EHH
instance. The positon of the new site must also be set
(when recycling the Site
instance, its distance is reset to None
so we need
to set it to a proper value, say 0.1.:
>>> distant = list(map(int,f.readline().strip()))
>>> site.from_list(distant, egglib.alphabets.positive_infinite)
>>> site.position = 0.1
>>> ehh.load_distant(site)
>>> print(ehh.cur_haplotypes)
4
>>> print(ehh.get_EHH(0))
1.0
>>> print(ehh.get_EHH(1))
0.6153846153846154
>>> print(ehh.get_EHH(2))
1.0
>>> print(ehh.get_rEHH(0))
2.14285714286
>>> print(ehh.get_iHH(0))
0.1
>>> print(ehh.get_iHS(0))
-0.495077266798
>>> print(ehh.get_EHHS())
0.991778569471
>>> print(ehh.get_iES())
0.0995889284736
We can observe that there is already an additional haplotype, meaning that the first distant site already modifies the sample partition. It concerns the second haplotype, whose EHH value drops to ~0.6. The rEHH value of the first haplotype has increased because the complement of the first haplotype is affected by the new haplotype. iHH is 0.1 (integration of EHH over the distance arbitrarily set to 0.1). The value of iHS is negative, which is possible because it is the logarithm of a ratio. The whole-site EHHS value has decreased because it accounts for data at the whole site.
The code below reads all the other sites of the file and displays final values, always using an increment of 0.1 for the position of each site:
>>> for i, line in enumerate(f):
... site.from_list(list(map(int,line.strip())), egglib.alphabets.positive_infinite)
... site.position = 0.2 + i / 10.0
... ehh.load_distant(site)
...
>>> print(ehh.cur_haplotypes)
42
>>> print(ehh.get_EHH(0))
0.0392156862745098
>>> print(ehh.get_EHH(1))
0.2058823529411765
>>> print(ehh.get_EHH(2))
1.566554621848739
>>> print(ehh.get_rEHH(0))
0.7030486479143488
>>> print(ehh.get_iHH(0))
0.04384762948753081
>>> print(ehh.get_iHS(0))
0.703048647914
>>> print(ehh.get_EHHS())
0.04384762948753081
>>> print(ehh.get_iES())
1.623691422307482
For more details about controlling at what point integration of EHH statistics
should stop, or management of missing data, see the class’s manual
.