Unphased sites statistics¶
The following statistics are designed to be computed over a set of sites but do not require that the sites are phased. Most of them are applicable to an alignment of DNA sequences (or for a set of single-nucleotide polymorphism markers).
They are computed by process_align()
and
process_sites()
of ComputeStats
,
as well as process_freq()
and
process_site()
in the multiple site mode.
Code |
Definition |
Equation |
Requirement |
Notes |
---|---|---|---|---|
|
Average number of exploitable samples |
1 |
||
|
Number of analysed sites |
2 |
||
|
Maximal number of available samples per site |
|||
|
Number of segregating sites |
|||
|
Number of sites with one singleton allele |
|||
|
Minimal number of mutations |
3 |
||
|
Polymorphic sites |
5 |
||
|
Sites with one singleton allele |
5 |
||
|
Number of alleles per polymorphic site |
|||
|
Allelic frequencies per polymorphic site |
|||
|
Population allelic frequencies per polymorphic site |
|||
|
Watterson’s estimator of \(\theta\) |
4 |
||
|
Nucleotide diversity |
4 |
||
|
Number of analysed orientable sites |
Outgroup |
||
|
Average number of exploitable samples at orientable sites |
Outgroup |
||
|
Maximal number of available samples per orientable site |
Outgroup |
2 |
|
|
Orientable polymorphic sites |
Outgroup |
5 |
|
|
Orientable sites with one singleton allele |
Outgroup |
5 |
|
|
Number of segregating orientable sites |
Outgroup |
||
|
Number of orientable sites with one singleton allele |
Outgroup |
||
|
Number of sites with one derived singleton allele |
Outgroup |
||
|
Minimal number of mutations are orientable sites |
Outgroup |
3 |
|
|
Tajima’s D |
|||
|
Tajima’s D using |
|||
|
Fu and Li’s D |
Outgroup |
||
|
Fu and Li’s F |
Outgroup |
||
|
Fu and Li’s D* |
|||
|
Fu and Li’s F* |
|||
|
Number of sites used for the MFDM test |
Outgroup |
||
|
P-value of MDFM test |
Outgroup |
||
|
Pi using orientable sites |
Outgroup |
4 |
|
|
Fay and Wu’s estimator of \(\theta\) |
Outgroup |
4 |
|
|
Zeng et al.’s estimator of \(\theta\) |
Outgroup |
4 |
|
|
Fay and Wu’s H (unstandardized) |
Outgroup |
||
|
Fay and Wu’s H (standardized) |
Outgroup |
||
|
Zeng et al.’s E |
Outgroup |
||
|
Pairwise distance |
Two populations |
6 |
|
|
Net pairwise distance |
Two populations |
6 |
Notes:
The number of exploitable samples may vary between sites due to missing data.
Number of sites considered for polymorphism detection, after discarding sites with too many missing data in the case of the method
process_align()
(controlled by the parameter max_missing). Sites with less than two non-missing samples are always discarded.This value is properly computed even if sites with more than two alleles are excluded.
Provided per gene (must be divided by
lseff
orlseffo
to be expressed per site).Returned as a
list
containing the index of all concerned sites.Only computed if there are two populations exactly.
Level of diversity¶
The so-called Watterson’s estimator of \(\theta\) (theta_W
)
is mentioned in Watterson (Theor. Popul. Biol. 1975, 7:256-276).
where n is equal to nseff rounded to the closest integer.
Nucleotide diversity (Pi
) is given by:
where \(p_{i,j}\) is the relative frequency of allele j at site i and \(n_i\) is the number of exploitable samples at site i.
Tajima’s D¶
Tajima’s D (Genetics 1989 123:585-595) is computed as follows:
where the variance is computed as follows:
A variant is available where \(\eta\) (the minimal number of mutation) is used instead of \(S\):
with:
Fu and Li’s tests with an outgroup¶
Fu and Li (Genetics 1993 133:693-709) proposed alternatives to Tajima’s D computed as follows:
and
where \(\eta\) is the minimal number of mutations at orientable sites, \(\eta_e\) is the total number of singletons at orientable sites, and \(H_{e_i}\) is the heterozygosity at site \(i\).
(If \(n\) is equal to 2, \(c_n\) is set to 1.)
The variance for F is computed as follows:
Variables are computed as for Tajima’s D but considering only orientable sites.
Fu and Li’s tests without outgroup¶
The following tests don’t require an outgroup:
\(n\) is nseff
rounded to unity as for Tajima’s D, \(\eta_e\) is
the total number of singletons.
Expressions for \(v_F\) and \(u_F\) are given by Simonsen et al. (Genetics 1995 141:413-429).
MFDM test¶
The P-value of the MFDM (maximum frequency of derived mutation) test (Li Mol. Biol. Evol. 2011 28:365-375) is computed as follows:
where \(n_i\) is the number of available samples at site i and \(d_{i,j}\) is the absolute frequency of the derived allele j, assuming that its frequency is more than half of the sample. If no site has a derived allele most frequent than half of the sample, the P-value is set to 1. If no site has a derived allele at least as frequent as half of the sample, the P-value is undefined.
Neutrality tests with an outgroup¶
Statistics defined by Fay and Wu (Genetics 2000 155:1405-1413) and Zeng et al. (Genetics 2006 174:1431-1439).
Three additional \(\theta\) estimators are defined based on orientable sites:
where \(n_{max}\) is the maximal number of exploitable samples over orientable sites, and \(S_i\) is the number of derived alleles (aggregating all alleles from all considered sites) which have been found in i copies.
The following test statistics are defined. First, the non-standardized H statistic of Fay and Wu:
Second, the standardized version of the above:
with the numerator variance estimated as follows:
where \(\eta_o\) is equal to etao
, the total number of mutations at
orientable sites, \(n_o\) is equal to nseffo
, the average number of
samples at orientable sites, and \(n_o'\) is nseffo
rounded to unity.
Third, the E statistic:
with:
Pairwise population distance¶
Here is how pairwise distance is computed (Nei 1987 Molecular Evolutionary Genetics), with
Here is the formula for the net pairwise distance:
where \(L\) is the number of sites, \(k_i\) is the number of alleles at site \(i\), \(p_{ijk}\) is the relative frequency of allele \(j\) of site \(i\) in population \(k\), and \(\pi_k\) is \(\pi\) for population \(k\). These statistics are only computed for a pair of populations.