Single site statistics¶
Statistics that can be computed for a single site are mainly aimed at genetic markers exhibiting many alleles (such as microsatellite). Some of them can be relevant to DNA polymorphism, but in most cases they should be averaged over many sites.
All those statistics are computed by the class ComputeStats
.
The methods process_freq()
and
process_site()
return the values for
a single site, while process_sites()
and
process_align()
compute an average over all
provided sites.
Code 
Definition 
Formula 
Requirement 
Notes 


Number of analyzed samples 
1 


Number of analyzed outgroup samples 
Outgroup 
1 


Number of alleles in ingroup 


Number of alleles in outgroup 
Outgroup 


Number of alleles in whole dataset 


Number of singleton alleles 
2 


Number of singleton alleles (derived) 
Outgroup 
2 


Allelic richness 


Expected heterozygosity 


\(\theta\) estimator under the IAM model 


\(\theta\) estimator under the SMM model 


Observed heterozygosity 
Individuals 
3 


Inbreeding coefficient 
Individuals 


Minority allele relative frequency 


Minority allele per population 
Populations 
4,5 


Hudson’s Hst 
Populations 


Nei’s Gst 
Populations 


Hedrick’s Gst’ 
Populations 


Jost’s D 
Populations 


Weir and Cockerham estimator (haploid data) 
Populations 
6 


Weir and Cockerham estimators (diploid data) 
Populations, individuals 
6,7 


Weir and Cockerham estimators (hierarchical) 
Populations, individuals, clusters 
6,7 


Number of populationspecific alleles 
Populations 
8 


Number of populationspecific derived alleles 
Populations, outgroup 
8 


Number of shared alleles 
Populations 
8 


Number of shared segregating alleles 
Populations 
8 


Number of fixed alleles 
Populations 
8 


Number of fixed differences 
Populations 
8 


Number of sites with at least one populationspecific allele 
Populations 
8, 9 


Number of sites with at least one populationspecific derived allele 
Populations, outgroup 
8, 9 


Number of sites with at least one shared allele 
Populations 
8, 9 


Number of sites with at least one shared segregating allele 
Populations 
8, 9 


Number of sites with at least one fixed allele 
Populations 
8, 9 


Number of sites with at least one fixed difference 
Populations 
8, 9 


Number of sites falling into fixation pattern categories 
Three populations 
10 
Notes:
Total number of samples excluding all samples with missing data. A sample is defined as a sampled allele (a diploid individual corresponds to two samples).
A singleton allele is an allele present in one copy in the whole sample (excluding outgroup).
Computed as the proportion of heterozygote individuals.
Relative frequency in each population of the allele which is minority in the whole sample, even if it is absent or not minority in some populations.
Returned as a
list
, even if there is only one population.Multisite average is computed as the ratio of the sum of numerator terms to the sum of numerator terms for all exploitable sites.
Returned as a
list
with the different estimators (see formulas).A populationspecific allele is an allele which is at nonnull frequency in one population only. A fixed allele is an allele which is at frequency 0 in at least one population and at (relative) frequency 1 in at least one population. A shared allele is an allele which is at nonnull frequencies in at least two populations. A shared polymorphism is a pair of populations which have at least two common segregating (0 < relative frequency < 1) alleles. A fixed difference is a pair of populations which have two different alleles at relative frequency 1.
Only computed if several sites are analyzed.
Only biallelic sites meeting the missing data criterion are considered. The criterion is given by the configuration option
triconfig_min
(minimum number of samples per population, default 2) andmax_missing
, if relevant, is ignored. The result is given as a 13item list, filled with zeros by default, giving the counts for the patterns in the following order (where A and B stand for two arbitrary alleles fixed in a population, and P a polymorphism of the two alleles in the population): ABB, ABA, AAB, PAA, PAB, APA, APB, AAP, ABP, PPA, PAP, APP, PPP.
Basic statistics¶
with:
\(n\), the number of samples (given by
ns_site
)\(k\), the number of alleles
\(p_i\), the relative frequency of allele \(i\)
Theta estimators¶
thetaIAM
thetaSMM
Fixation index (departure from Hardy Weinberg equilibrium)¶
Population differentiation¶
In this section we define:
\(r\) 
number of populations 
\(n_i\) 
sample size of population \(i\) 
\(n_t\) 
total sample size 
\(k\) 
number of alleles 
\(p_i\) 
relative frequency of allele \(i\) in the whole sample 
\(p_{ij}\) 
relative frequency of allele \(i\) in population \(j\) 
and we exclude any populations with less than two samples.
\(H_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138151) is defined as follows:
with
and
with:
Nei’s \(G_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138151) is defined as follows:
with
and
with
\(G_{ST}'\) (Hedrick Evolution 17:40154026) is defined as:
with
and
Jost’s \(D\) (Mol. Ecol. 2008 18:40154026) is computed as:
with:
and
Fstatistics estimators¶
Estimators of Fstatistics are based on Weir and Cockerham (Evolution 1984 38:13581370) and Weir and Hill (Annu Rev. Genet. 36:721750).
Different estimators are available depending on which levels of structure
are provided through a Structure
instance.
Population structure only¶
If only the population structure is available, only the equivalent of \(F_{ST}\) (\(\hat{\theta}\) in Weir and Cockerham’s notation) is available.
where \(n_p\) is the number of samples of population \(p\), \(n_t\) is the total number of samples, and \(k\) is the number of considered populations. Only populations with at least two samples are considered.
For a given allele \(i\), we compute:
where \(\bar{p}_i\) is the overall relative frequency of allele \(i\) in the whole sample and \(p_{ip}\) is the relative frequency of allele \(i\) in population \(p\).
The equivalent of \(F_{ST}\) is then computed as:
Population and individual structure¶
If both population and individual structures are available, the decomposition of inbreeding in three terms, \(F\) (equivalent to \(F_{IT}\)), \(\theta\) (equivalent to \(F_{ST}\), and \(f\) (equivalent to \(F_{IS}\)) is possible. The estimators of these fixation indexes are defined below, following Weir and Cockerham (Evolution 1984 38:13581370).
The estimators are based on three components of variance, noted \(a\) (between populations), \(b\) (between individuals within populations), and \(c\) (within individuals):
with:
\(A\), the number of alleles
\(k\), the number of populations with at least one individual
\(\bar{n}\), the average number of individuals per population
\(\bar{p}_i\), the relative frequency of allele \(i\) in the whole sample
\(\bar{h}_i\), the proportion of individuals carrying allele \(i\) as the heterozygote state, calculated in the whole sample
\(s^2_i\), as defined below:
\(n_c\), as defined below:
\(n_p\), the number of individuals in population \(p\)
\(p_{ap}\) the relative frequency of allele \(a\) in population \(p\)
The return value for FistWC
is a tuple with the three Fstatistics estimators:
\(\left(\hat{f}, \hat{\theta}, \hat{F}\right)\), which are equivalent to
\(\left(F_{IS}, F_{ST}, F_{IT}\right)\) and are defined as follows:
Clusters, population and individual structure¶
If, in addition, populations are grouped in clusters, it is possible to compute an additional fixation index: the betweenpopulation fixation index \(\theta\) (or \(F_{ST}\)) is subdivided in a betweenpopulation, withincluster component \(\theta_1\) (or \(F_{SC}\)) and a betweencluster component \(\theta_2\) (or \(F_{CT}\)). The estimators are based on four components of variance, noted \(a\) (between clusters), \(b_2\) (between populations within clusters), \(b_1\) (between individuals within populations), and \(c\) (within individuals). They are computed as described in Weir and Cockerham (Evolution 1984 38:13581370).
\(\alpha\) (MSG en Weir and Cockerham’s article) is computed as:
\(\beta\) (MSI en Weir and Cockerham’s article) is computed as:
\(\delta\) (MSD en Weir and Cockerham’s article) is computed as:
\(\epsilon\) (MSP en Weir and Cockerham’s article) is computed as:
with:
\(k\) number of populations with at least one individual
\(r\) number of clusters with at least one population
\(n\) total number of individuals (in considered populations)
\(n_p\) number of individuals in population \(p\)
\(n_c\) number of individuals in population \(c\)
\(p_i\) relative frequency of allele \(i\) in the whole sample
\(p_{ip}\) relative frequency of allele \(i\) in population \(p\)
\(p_{ic_p}\) relative frequency of allele \(i\) in the cluster containing population \(p\)
\(p_{ic}\) relative frequency of allele \(i\) in the cluster \(c\)
\(h_{ip}\) number of heterozygote individuals carrying allele \(i\) in population \(p\)
The return value for FisctWC
is a tuple with the four Fstatistics estimators:
\(\left(\hat{f}, \hat{\theta}_1, \hat{\theta}_2, \hat{F}\right)\), which are equivalent to
\(\left(F_{IS}, F_{SC}, F_{CT}, F_{IT}\right)\) and are defined as follows: