Unphased sites statistics¶
The following statistics are designed to be computed over a set of sites but do not require that the sites are phased. Most of them are applicable for an alignment of DNA sequences (or for a set of singlenucleotide markers).
They are computed by process_align()
and
process_sites()
of stats.ComputeStats
,
as well as process_freq()
and
process_site()
in the multiple site mode.
Code 
Definition 
Equation 
Requirement 
Notes 


Average number of exploitable samples 
1 


Number of analysed sites 
2 


Maximal number of available samples per site 


Number of segregating sites 


Number of sites with one singleton allele 


Minimal number of mutations 
3 


Polymorphic sites 
5 


Sites with one singleton allele 
5 


Watterson’s estimator of \(\theta\) 
4 


Nucleotide diversity 
4 


Number of analysed orientable sites 
Outgroup 


Average number of exploitable samples at orientable sites 
Outgroup 


Maximal number of available samples per orientable site 
Outgroup 
2 


Orientable polymorphic sites 
Outgroup 
5 


Orientable sites with one singleton allele 
Outgroup 
5 


Number of segregating orientable sites 
Outgroup 


Number of orientable sites with one singleton allele 
Outgroup 


Number of sites with one derived singleton allele 
Outgroup 


Minimal number of mutations are orientable sites 
Outgroup 
3 


Tajima’s D 


Tajima’s D using 


Fu and Li’s D 
Outgroup 


Fu and Li’s F 
Outgroup 


Fu and Li’s D* 


Fu and Li’s F* 


Number of sites used for the MFDM test 
Outgroup 


Pvalue of MDFM test 
Outgroup 


Pi using orientable sites 
Outgroup 
4 


Fay and Wu’s estimator of \(\theta\) 
Outgroup 
4 


Zeng et al.’s estimator of \(\theta\) 
Outgroup 
4 


Fay and Wu’s H (unstandardized) 
Outgroup 


Fay and Wu’s H (standardized) 
Outgroup 


Zeng et al.’s E 
Outgroup 


Pairwise distance 
Two populations 
6 


Net pairwise distance 
Two populations 
6 
Notes:
The number of exploitable samples may vary between sites due to missing data.
Number of sites considered for polymorphism detection, after discarding sites with too many missing data in the case of the method
process_align()
(controlled by the parameter max_missing). Sites with less than two nonmissing samples are always discarded.This value is properly computed even if sites with more than two alleles are excluded.
Provided per gene (must be divided by
lseff
orlseffo
to be expressed per site).Returned as a list containing the index of all concerned sites.
Only computed if there are two populations exactly.
Level of diversity¶
The socalled Watterson’s estimator of \(\theta\) (theta_W
)
is mentioned in Watterson (Theor. Popul. Biol. 1975, 7:256276).
where n is equal to nseff rounded to the closest integer.
Nucleotide diversity (Pi
) is given by:
where \(p_{i,j}\) is the relative frequency of allele j at site i and \(n_i\) is the number of exploitable samples at site i.
Tajima’s D¶
Tajima’s D (Genetics 1989 123:585595) is computed as follows:
where the variance is computed as follows:
A variant is available where \(\eta\) (the minimal number of mutation) is used instead of \(S\):
with:
Fu and Li’s tests with an outgroup¶
Fu and Li (Genetics 1993 133:693709) proposed alternatives to Tajima’s D computed as follows:
and
where \(\eta\) is the minimal number of mutations at orientable sites, \(\eta_e\) is the total number of singletons at orientable sites, and \(H_{e_i}\) is the heterozygosity at site \(i\).
(If \(n\) is equal to 2, \(c_n\) is set to 1.)
The variance for F is computed as follows:
Variables are computed as for Tajima’s D but considering only orientable sites.
Fu and Li’s tests without outgroup¶
The following tests don’t require an outgroup:
\(n\) is nseff
rounded to unity as for Tajima’s D, \(\eta_e\) is
the total number of singletons.
Expressions for \(v_F\) and \(u_F\) are given by Simonsen et al. (Genetics 1995 141:413429).
MFDM test¶
The Pvalue of the MFDM (maximum frequency of derived mutation) test (Li Mol. Biol. Evol. 2011 28:365375) is computed as follows:
where \(n_i\) is the number of available samples at site i and \(d_{i,j}\) is the absolute frequency of the derived allele j, assuming that its frequent is more than half of the sample. If no site has a derived allele most frequent than half of the sample, the Pvalue is set to 1. If no site has a derived allele at least as frequent as half of the sample, the Pvalue is undefined.
Neutrality tests with an outgroup¶
Statistics defined by Fay and Wu (Genetics 2000 155:14051413) and Zeng et al. (Genetics 2006 174:14311439).
Three additional \(\theta\) estimators are defined based on orientable sites:
where \(n_{max}\) is the maximal number of exploitable samples over orientable sites, and \(S_i\) is the number of derived alleles (aggregating all alleles from all considered sites) which have been found in i copies.
The following test statistics are defined. First, the nonstandardized H statistic of Fay and Wu:
Second, the standardized version of the above:
with the numerator variance estimated as follows:
where \(\eta_o\) is equal to etao
, the total number of mutations at
orientable sites, \(n_o\) is equal to nseffo
, the average number of
samples at orientable sites, and \(n_o'\) is nseffo
rounded to unity.
Thirds, the E statistic:
with:
Pairwise population distance¶
Here is how pairwise distance is computed (Nei 1987 Molecular Evolutionary Genetics), with
Here is the formula for the net pairwise distance:
where \(L\) is the number of sites, \(k_i\) is the number of alleles at site \(i\), \(p_{ijk}\) is the relative frequency of allele \(j\) of site \(i\) in population \(k\), and \(\pi_k\) is \(\pi\) for population \(k\). These statistics are only computed for a pair of populations.