Phased sites statistics¶
The following statistics are designed to be computed over a set of phased sites. Alleles within individuals must be phased as well.
They are computed by process_align()
and
process_sites()
of stats.ComputeStats
,
as well as process_site()
in the multiple site mode,
but not process_freq()
.
Code 
Definition 
Equation 
Requirement 
Notes 


RamosOnsins and Rozas’s \(R_2\) 


RamosOnsins and Rozas’s \(R_3\) 


RamosOnsins and Rozas’s \(R_4\) 


RamosOnsins and Rozas’s \(Ch\) 


RamosOnsins and Rozas’s \(R_{2E}\) 
Outgroup 
1 


RamosOnsins and Rozas’s \(R_{3E}\) 
Outgroup 
1 


RamosOnsins and Rozas’s \(R_{4E}\) 
Outgroup 
1 


RamosOnsins and Rozas’s \(Ch_E\) 
Outgroup 
1 


Wall’s B statistic 


Walls Q statistic 


Number of haplotypes (only ingroup) 


Total number of haplotypes (including outgroup) 


Hudson et al’s \(F_{ST}\) 
Populations 


Hudson et al’s \(K_{ST}\) 
Populations 


Hudson’s nearest nearest neighbour statistic’ 
Populations 


\(\bar{r}_d\) statistic 
2 


Minimal number of recombination events 
3 


Number of sites used to compute Rmin 
3 


List of start/end positions of recombination intervals 
3 


Number of allele pairs used for \(Z_{nS}\) and related statistics 


Allele pairs at adjacent sites (used for \(ZZ\) and \(Z_A\)) 


Kelly et al.’s \(Z_{nS}\) 


Kelly et al.’s \(Z^*_{nS}\) 


Kelly et al.’s \(Z^*_{nS}{}^*\) 


Rozas et al.’s \(Z_A\) 


Rozas et al.’s \(ZZ\) 


Fu’s F_S 
Notes:
Based on mutations on external branches (that is, derived singletons) instead of all singletons.
Does not require that alleles within individuals are phased.
The minimal number of recombination events (
Rmin
) is computed after Hudson and Kaplan (Genetics 1985 111:147164). Briefly, this number of equal to the minimal number of nonoverlapping segments defined by incompatible sites (ie breaking the threeallele rule). Site with missing data or with more than two alleles are skipped. The number of sites used for this analysis and the positions of those intervals are provided asRminL
andRintervals
, respectively.
RamosOnsins and Rozas’s test statistics¶
RamosOnsins and Rozas (Mol. Biol. Evol. 2002 19:20922100) develop several tests of neutrality based on singletons. \(R_2\), \(R_3\), and \(R_4\) are computed as:
with:
\(n\) the number of samples, \(S\) the number of segregating sites, \(k_i\) the number of alleles at site \(i\), \(S_i\) the number of singletons borne by the \(i`th sample, and :math:`p_{ij}\) the relative frequency of allele \(j\) at site \(i\).
and \(Ch\) is computed as:
where \(U\) is the total number of singletons.
Wall’s statistics¶
Tests based on partitions of the sample defined by polymorphic are defined by Wall (Genetic. Res. 1999 74:6579):
where \(B'\) is defined as the number of pairs of adjacent polymorphic sites (considering only sites with no missing data and two alleles) that are congruent (that is, for each there is only two haplotypes considering the pair of sites) and \(n_P\) is the number of distinct partitions of the sample set defined by sites (\(S\) is the number of sites considered in the analysis).
Hudson’s differentiation statistics¶
Hudson et al. (Mol. Biol. Evol. 1992 9:138151) haplotype statistics based on Wright’s fixation index.
with:
where \(r\) is the number of populations, \(n\) is the total number of samples, \(n_i\) is the number of samples in population \(i\), \(K_i\) is the sum of the number of pairwise differences between all pairs of samples of population \(i\), \(K_{d_{ij}}\) is the sum of pairwise differences between all pairs of samples comprising one sample from population \(i\) and the other from population \(j\), \(n_W\) is the number of populations, \(n_B\) is the number of pairs of populations (populations with less than two samples are excluded).
Hudson (Genetics 2000 155:20112014) introduced the nearest neighbour statistic. The nearest neighbour is, for a given sequence \(i\), the sequence which has the less pairwise differences relatively to sequence \(i\) (excluding itself). There can be several ex aequo nearest neighbours. Then, \(X_i\) is the proportion of those nearest neighbours which come from the same population as sequence \(i\), and \(S_{nn}\) is the average of \(X_i\):
Standardized association index¶
The \(\bar{r}_d\) statistic has been introduced by Agapow and Burt (Mol. Ecol. Notes 2001 1:101102).
with:
and:
where the site variance is given, for site \(s\), by:
where \(L\) is the total number of sites considered, \(k_{ij}\) is the number of sites with available data for samples \(i\) and \(j\), \(n_P\) is the number of pairs of samples with \(k_{ij}\) greater than 0, \(n_s\) is the number of samples available at site \(s\), and \(d_{sij}\) is the number of alleles of the genotype of individual \(i\) that are not present in the genotype of individual \(j\) as site \(s\).
Linkage disequilibrium summary statistics¶
Kelly (Genetics 1997 146:11971206) introduced a neutrality statistic based on pairwise linkage disequilibrium values:
Two variants are available:
Rozas et al. (Genetics 2001 158:11471155) introduced the additional statistics \(ZZ\):
where \(Z_a\) is computed as \(Z_{nS}\) but considering only adjacent polymorphic sites (that is, pairs of polymorphic sites that don’t have a polymorphic site in between).
\(n\) is the number of allele pairs considered for each statistic.
The sums of \(r^2\) and of \({D'}^2\) are computed over all pairs of sites. For sites with more than two alleles, the behaviour is controlled by the option LD_multiallelic:
ignore
: skip all sites with more than two alleles.use_main
: use the most frequeny allele.use_all
: use all possible pairs of alleles.
Linkage disequilibrium statistics are defined here
Fu’s statistic¶
Fu’s \(F_S\) (Genetics 1997 147:915925) is computed as:
with:
where \(K\) is the number of haplotypes, \(n\) is the number of samples used, and \(S_n^k\) is the Sterling number of the first kind as computed: