Single site statistics¶
Statistics that can be computed for a single site are mainly aimed at genetic markers exhibiting many alleles (such as microsatellite). Some of them can be relevant to DNA polymorphism, but in most cases they should be averaged over many sites.
All those statistics are computed by the class ComputeStats
.
The methods process_freq()
and
process_site()
return the values for
a single site, while process_sites()
and
process_align()
compute an average over all
provided sites.
Code |
Definition |
Formula |
Requirement |
Notes |
---|---|---|---|---|
|
Number of analyzed samples |
1 |
||
|
Number of analyzed outgroup samples |
Outgroup |
1 |
|
|
Number of alleles in ingroup |
|||
|
Number of alleles in outgroup |
Outgroup |
||
|
Number of alleles in whole dataset |
|||
|
Number of singleton alleles |
2 |
||
|
Number of singleton alleles (derived) |
Outgroup |
2 |
|
|
Allelic richness |
|||
|
Expected heterozygosity |
|||
|
\(\theta\) estimator under the IAM model |
|||
|
\(\theta\) estimator under the SMM model |
|||
|
Observed heterozygosity |
Individuals |
3 |
|
|
Inbreeding coefficient |
Individuals |
||
|
Minority allele relative frequency |
|||
|
Minority allele per population |
Populations |
4,5 |
|
|
Hudson’s Hst |
Populations |
||
|
Nei’s Gst |
Populations |
||
|
Hedrick’s Gst’ |
Populations |
||
|
Jost’s D |
Populations |
||
|
Weir and Cockerham estimator (haploid data) |
Populations |
6 |
|
|
Weir and Cockerham estimators (diploid data) |
Populations, individuals |
6,7 |
|
|
Weir and Cockerham estimators (hierarchical) |
Populations, individuals, clusters |
6,7 |
|
|
Number of population-specific alleles |
Populations |
8 |
|
|
Number of population-specific derived alleles |
Populations, outgroup |
8 |
|
|
Number of shared alleles |
Populations |
8 |
|
|
Number of shared segregating alleles |
Populations |
8 |
|
|
Number of fixed alleles |
Populations |
8 |
|
|
Number of fixed differences |
Populations |
8 |
|
|
Number of sites with at least one population-specific allele |
Populations |
8, 9 |
|
|
Number of sites with at least one population-specific derived allele |
Populations, outgroup |
8, 9 |
|
|
Number of sites with at least one shared allele |
Populations |
8, 9 |
|
|
Number of sites with at least one shared segregating allele |
Populations |
8, 9 |
|
|
Number of sites with at least one fixed allele |
Populations |
8, 9 |
|
|
Number of sites with at least one fixed difference |
Populations |
8, 9 |
|
|
Number of sites falling into fixation pattern categories |
Three populations |
10 |
Notes:
Total number of samples excluding all samples with missing data. A sample is defined as a sampled allele (a diploid individual corresponds to two samples).
A singleton allele is an allele present in one copy in the whole sample (excluding outgroup).
Computed as the proportion of heterozygote individuals.
Relative frequency in each population of the allele which is minority in the whole sample, even if it is absent or not minority in some populations.
Returned as a
list
, even if there is only one population.Multi-site average is computed as the ratio of the sum of numerator terms to the sum of numerator terms for all exploitable sites.
Returned as a
list
with the different estimators (see formulas).A population-specific allele is an allele which is at non-null frequency in one population only. A fixed allele is an allele which is at frequency 0 in at least one population and at (relative) frequency 1 in at least one population. A shared allele is an allele which is at non-null frequencies in at least two populations. A shared polymorphism is a pair of populations which have at least two common segregating (0 < relative frequency < 1) alleles. A fixed difference is a pair of populations which have two different alleles at relative frequency 1.
Only computed if several sites are analyzed.
Only biallelic sites meeting the missing data criterion are considered. The criterion is given by the configuration option
triconfig_min
(minimum number of samples per population, default 2) andmax_missing
, if relevant, is ignored. The result is given as a 13-item list, filled with zeros by default, giving the counts for the patterns in the following order (where A and B stand for two arbitrary alleles fixed in a population, and P a polymorphism of the two alleles in the population): ABB, ABA, AAB, PAA, PAB, APA, APB, AAP, ABP, PPA, PAP, APP, PPP.
Basic statistics¶
with:
\(n\), the number of samples (given by
ns_site
)\(k\), the number of alleles
\(p_i\), the relative frequency of allele \(i\)
Theta estimators¶
thetaIAM
thetaSMM
Fixation index (departure from Hardy Weinberg equilibrium)¶
Population differentiation¶
In this section we define:
\(r\) |
number of populations |
\(n_i\) |
sample size of population \(i\) |
\(n_t\) |
total sample size |
\(k\) |
number of alleles |
\(p_i\) |
relative frequency of allele \(i\) in the whole sample |
\(p_{ij}\) |
relative frequency of allele \(i\) in population \(j\) |
and we exclude any populations with less than two samples.
\(H_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:
with
and
with:
Nei’s \(G_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:
with
and
with
\(G_{ST}'\) (Hedrick Evolution 17:4015-4026) is defined as:
with
and
Jost’s \(D\) (Mol. Ecol. 2008 18:4015-4026) is computed as:
with:
and
F-statistics estimators¶
Estimators of F-statistics are based on Weir and Cockerham (Evolution 1984 38:1358-1370) and Weir and Hill (Annu Rev. Genet. 36:721-750).
Different estimators are available depending on which levels of structure
are provided through a Structure
instance.
Population structure only¶
If only the population structure is available, only the equivalent of \(F_{ST}\) (\(\hat{\theta}\) in Weir and Cockerham’s notation) is available.
where \(n_p\) is the number of samples of population \(p\), \(n_t\) is the total number of samples, and \(k\) is the number of considered populations. Only populations with at least two samples are considered.
For a given allele \(i\), we compute:
where \(\bar{p}_i\) is the overall relative frequency of allele \(i\) in the whole sample and \(p_{ip}\) is the relative frequency of allele \(i\) in population \(p\).
The equivalent of \(F_{ST}\) is then computed as:
Population and individual structure¶
If both population and individual structures are available, the decomposition of inbreeding in three terms, \(F\) (equivalent to \(F_{IT}\)), \(\theta\) (equivalent to \(F_{ST}\), and \(f\) (equivalent to \(F_{IS}\)) is possible. The estimators of these fixation indexes are defined below, following Weir and Cockerham (Evolution 1984 38:1358-1370).
The estimators are based on three components of variance, noted \(a\) (between populations), \(b\) (between individuals within populations), and \(c\) (within individuals):
with:
\(A\), the number of alleles
\(k\), the number of populations with at least one individual
\(\bar{n}\), the average number of individuals per population
\(\bar{p}_i\), the relative frequency of allele \(i\) in the whole sample
\(\bar{h}_i\), the proportion of individuals carrying allele \(i\) as the heterozygote state, calculated in the whole sample
\(s^2_i\), as defined below:
\(n_c\), as defined below:
\(n_p\), the number of individuals in population \(p\)
\(p_{ap}\) the relative frequency of allele \(a\) in population \(p\)
The return value for FistWC
is a tuple with the three F-statistics estimators:
\(\left(\hat{f}, \hat{\theta}, \hat{F}\right)\), which are equivalent to
\(\left(F_{IS}, F_{ST}, F_{IT}\right)\) and are defined as follows:
Clusters, population and individual structure¶
If, in addition, populations are grouped in clusters, it is possible to compute an additional fixation index: the between-population fixation index \(\theta\) (or \(F_{ST}\)) is subdivided in a between-population, within-cluster component \(\theta_1\) (or \(F_{SC}\)) and a between-cluster component \(\theta_2\) (or \(F_{CT}\)). The estimators are based on four components of variance, noted \(a\) (between clusters), \(b_2\) (between populations within clusters), \(b_1\) (between individuals within populations), and \(c\) (within individuals). They are computed as described in Weir and Cockerham (Evolution 1984 38:1358-1370).
\(\alpha\) (MSG en Weir and Cockerham’s article) is computed as:
\(\beta\) (MSI en Weir and Cockerham’s article) is computed as:
\(\delta\) (MSD en Weir and Cockerham’s article) is computed as:
\(\epsilon\) (MSP en Weir and Cockerham’s article) is computed as:
with:
\(k\) number of populations with at least one individual
\(r\) number of clusters with at least one population
\(n\) total number of individuals (in considered populations)
\(n_p\) number of individuals in population \(p\)
\(n_c\) number of individuals in population \(c\)
\(p_i\) relative frequency of allele \(i\) in the whole sample
\(p_{ip}\) relative frequency of allele \(i\) in population \(p\)
\(p_{ic_p}\) relative frequency of allele \(i\) in the cluster containing population \(p\)
\(p_{ic}\) relative frequency of allele \(i\) in the cluster \(c\)
\(h_{ip}\) number of heterozygote individuals carrying allele \(i\) in population \(p\)
The return value for FisctWC
is a tuple with the four F-statistics estimators:
\(\left(\hat{f}, \hat{\theta}_1, \hat{\theta}_2, \hat{F}\right)\), which are equivalent to
\(\left(F_{IS}, F_{SC}, F_{CT}, F_{IT}\right)\) and are defined as follows: