Single site statistics¶

Statistics that can be computed for a single site are mainly aimed at genetic markers exhibiting many alleles (such as microsatellite). Some of them can be relevant to DNA polymorphism, but in most cases they should be averaged over many sites.

All those statistics are computed by the class ComputeStats. The methods process_freq() and process_site() return the values for a single site, while process_sites() and process_align() compute an average over all provided sites.

In addition, the method stats.SFS() allows to compute the site frequency spectrum from a set of sites based on either minor allele (folded) or derived allele (unfolded) frequencies for diallelic sites.

Code	Definition	Formula	Requirement	Notes
`+site`	All statistics from table
`ns_site`	Number of analyzed samples			1
`ns_site_o`	Number of analyzed outgroup samples		Outgroup	1
`Aing`	Number of alleles in ingroup
`Aotg`	Number of alleles in outgroup		Outgroup
`Atot`	Number of alleles in whole dataset
`As`	Number of singleton alleles			2
`Asd`	Number of singleton alleles (derived)		Outgroup	2
`R`	Allelic richness	(1)
`He`	Expected heterozygosity	(2)
`thetaIAM`	\(\theta\) estimator under the IAM model	(3)
`thetaSMM`	\(\theta\) estimator under the SMM model	(4)
`Ho`	Observed heterozygosity		Individuals	3
`Fis`	Inbreeding coefficient	(5)	Individuals
`maf`	Minority allele relative frequency
`maf_pop`	Minority allele per population		Populations	4,5
`Hst`	Hudson’s Hst	(6)	Populations
`Gst`	Nei’s Gst	(7)	Populations
`Gste`	Hedrick’s Gst’	(8)	Populations
`Dj`	Jost’s D	(9)	Populations
`FstWC`	Weir and Cockerham estimator (haploid data)	(10)	Populations	6
`FistWC`	Weir and Cockerham estimators (diploid data)	(11) (12) (13)	Populations, individuals	6,7
`FisctWC`	Weir and Cockerham estimators (hierarchical)	(14) (15) (16) (17)	Populations, individuals, clusters	6,7
`f2`	Patterson et al’s \(f_2\)	(18)	Two populations
`f3`	Patterson et al’s \(f_3\)	(19)	Three population, one focal
`f4`	Patterson et al’s \(f_4\)	(20)	Two clusters of two populations each
`Dp`	Patterson et al’s D	(21)	Two clusters of two populations each	6
`numSp`	Number of population-specific alleles		Populations	8
`numSpd`	Number of population-specific derived alleles		Populations, outgroup	8
`numShA`	Number of shared alleles		Populations	8
`numShP`	Number of shared segregating alleles		Populations	8
`numFxA`	Number of fixed alleles		Populations	8
`numFxD`	Number of fixed differences		Populations	8
`numSp*`	Number of sites with at least one population-specific allele		Populations	8, 9
`numSpd*`	Number of sites with at least one population-specific derived allele		Populations, outgroup	8, 9
`numShA*`	Number of sites with at least one shared allele		Populations	8, 9
`numShP*`	Number of sites with at least one shared segregating allele		Populations	8, 9
`numFxA*`	Number of sites with at least one fixed allele		Populations	8, 9
`numFxD*`	Number of sites with at least one fixed difference		Populations	8, 9
`triconfig`	Number of sites falling into fixation pattern categories		Three populations	10

Notes:

Total number of samples excluding all samples with missing data. A sample is defined as a sampled allele (a diploid individual corresponds to two samples).
A singleton allele is an allele present in one copy in the whole sample (excluding outgroup).
Computed as the proportion of heterozygote individuals.
Relative frequency in each population of the allele which is minority in the whole sample, even if it is absent or not minority in some populations.
Returned as a list, even if there is only one population.
Multi-site average is computed as the ratio of the sum of numerator terms to the sum of numerator terms for all exploitable sites.
Returned as a list with the different estimators (see formulas).
A population-specific allele is an allele which is at non-null frequency in one population only. A fixed allele is an allele which is at frequency 0 in at least one population and at (relative) frequency 1 in at least one population. A shared allele is an allele which is at non-null frequencies in at least two populations. A shared polymorphism is a pair of populations which have at least two common segregating (0 < relative frequency < 1) alleles. A fixed difference is a pair of populations which have two different alleles at relative frequency 1.
Only computed if several sites are analyzed.
Only biallelic sites meeting the missing data criterion are considered. The criterion is given by the configuration option triconfig_min (minimum number of samples per population, default 2) and max_missing, if relevant, is ignored. The result is given as a 13-item list, filled with zeros by default, giving the counts for the patterns in the following order (where A and B stand for two arbitrary alleles fixed in a population, and P a polymorphism of the two alleles in the population): ABB, ABA, AAB, PAA, PAB, APA, APB, AAP, ABP, PPA, PAP, APP, PPP.

Basic statistics¶

(1)¶\[R = \frac{k-1}{n-1}\]

(2)¶\[H_e = (1 - \sum_i^k {p_i}^2) \frac{n} {(n-1)}\]

with:

\(n\), the number of samples (given by ns_site)
\(k\), the number of alleles
\(p_i\), the relative frequency of allele \(i\)

Theta estimators¶

thetaIAM

(3)¶\[\hat{\theta}_{IAM} = \frac{H_e}{1 - H_e}\]

thetaSMM

(4)¶\[\hat{\theta}_{SMM} = \frac{1}{2} \left[ \frac{1}{(1 - H_e)^2} - 1 \right]\]

Fixation index (departure from Hardy Weinberg equilibrium)¶

(5)¶\[F_{IS} = 1 - \frac{H_o}{H_e}\]

Population differentiation¶

In this section we define:

\(r\)	number of populations
\(n_i\)	sample size of population \(i\)
\(n_t\)	total sample size
\(k\)	number of alleles
\(p_i\)	relative frequency of allele \(i\) in the whole sample
\(p_{ij}\)	relative frequency of allele \(i\) in population \(j\)

and we exclude any populations with less than two samples.

\(H_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:

(6)¶\[H_{ST} = 1 - \frac{H_{S_1}}{H_{T_1}}\]

with

\[H_{S_1} = \frac{1}{\sum_i^r n_i - 2} \sum_i^r (n_i-2) H_i\]

and

\[H_{T_1} = \frac{n_t}{n_t - 1} \left[ 1-\sum_i^k \left( \frac{1}{n_t}\sum_j^r p_{ij} n_i \right)^2 \right]\]

with:

\[H_i = \frac{n_i}{n_i-1} \left[ 1 - \sum_j^k {p_{ji}}^2 \right]\]

Nei’s \(G_{ST}\) (Hudson et al. Mol. Biol. Evol. 1992 9:138-151) is defined as follows:

(7)¶\[G_{ST} = 1 - \frac{H_{S_2}}{\tilde{H}_T}\]

with

\[H_{S_2} = \frac{1}{n_t} \sum_i^r n_i H_i\]

and

\[\tilde{H}_T = 1 - \sum_i^k \left( \frac{1}{n_t} \sum_j^r p_{ij} n_j \right) ^2 + \frac{1}{r \cdot \tilde{n}} H_{S_2}\]

with

\[\tilde{n} = \frac{r} {\sum_i^r \frac{1}{n_i}}\]

\(G_{ST}'\) (Hedrick Evolution 17:4015-4026) is defined as:

(8)¶\[G'_{ST} = \frac{1 + H_{S_3}}{1 - H_{S_3}} \left( 1 - \frac{H_{S_3}}{H_{T_2}} \right)\]

with

\[H_{S_3} = \frac{1}{r} \sum_i^r \left( 1 - \sum_j^k {p_{ji}}^2 \right)\]

and

\[H_{T_2} = 1 - \sum_i^k \left( \frac{1}{r} \sum_j^r p_{ij} \right) ^2\]

Jost’s \(D\) (Mol. Ecol. 2008 18:4015-4026) is computed as:

(9)¶\[D = \frac{r}{r-1} \frac{H_{T_3} - H_{S_4}} {1 - H_{S_4}}\]

with:

\[H_{S_4} = \frac{\tilde{n}}{\tilde{n}-1} H_{S_3}\]

and

\[H_{T_3} = H_{T_2} + \frac{1}{r \cdot \tilde{n}} H_{S_4}\]

F-statistics estimators¶

Estimators of F-statistics are based on Weir and Cockerham (Evolution 1984 38:1358-1370) and Weir and Hill (Annu Rev. Genet. 36:721-750).

Different estimators are available depending on which levels of structure are provided through a Structure instance.

Population structure only¶

If only the population structure is available, only the equivalent of \(F_{ST}\) (\(\hat{\theta}\) in Weir and Cockerham’s notation) is available.

\[n_c = \frac{1}{k - 1} \left( n_t - \frac{1}{n_t} \sum_p^k {n_p}^2 \right)\]

where \(n_p\) is the number of samples of population \(p\), \(n_t\) is the total number of samples, and \(k\) is the number of considered populations. Only populations with at least two samples are considered.

For a given allele \(i\), we compute:

\[\alpha_i = \frac{1}{k-1} \sum_p^k n_p (p_{ip} - \bar{p}_i) ^2\]

\[\delta_i = \frac{1}{n_t-k} \sum_p^k n_p \cdot p_{ip} (1-p_{ip})\]

where \(\bar{p}_i\) is the overall relative frequency of allele \(i\) in the whole sample and \(p_{ip}\) is the relative frequency of allele \(i\) in population \(p\).

The equivalent of \(F_{ST}\) is then computed as:

(10)¶\[\hat{\theta} = \frac{\sum_i^A \alpha_i - \delta_i}{\sum_i \alpha_i + (n_c - 1) \delta_i}\]

Population and individual structure¶

If both population and individual structures are available, the decomposition of inbreeding in three terms, \(F\) (equivalent to \(F_{IT}\)), \(\theta\) (equivalent to \(F_{ST}\), and \(f\) (equivalent to \(F_{IS}\)) is possible. The estimators of these fixation indexes are defined below, following Weir and Cockerham (Evolution 1984 38:1358-1370).

The estimators are based on three components of variance, noted \(a\) (between populations), \(b\) (between individuals within populations), and \(c\) (within individuals):

\[a = \sum_i^A \frac{\bar{n}}{n_c} \left\{ s^2_i - \frac{1}{\bar{n}-1} \left[ \bar{p}_i(1-\bar{p}_i) - s^2_i\frac{k-1}{k} - \frac{\bar{h}_i}{4} \right] \right\}\]

\[b = \sum_i^A \frac{\bar{n}}{\bar{n}-1} \left[ \bar{p}_i(1-\bar{p}_i) - s^2_i \frac{k-1}{k} - \bar{h}_i\frac{2\bar{n}-1}{4\bar{n}} \right]\]

\[c = \sum_i^A \frac{1}{2} \bar{h}_i\]

with:

\(A\), the number of alleles
\(k\), the number of populations with at least one individual
\(\bar{n}\), the average number of individuals per population
\(\bar{p}_i\), the relative frequency of allele \(i\) in the whole sample
\(\bar{h}_i\), the proportion of individuals carrying allele \(i\) as the heterozygote state, calculated in the whole sample
\(s^2_i\), as defined below:

\[s^2_i = \frac{\bar{n}}{k-1} \sum_p^k n_p (p_{ap} - \bar{p}_a)^2\]

\(n_c\), as defined below:

\[n_c = \frac{1}{k-1} \left( k \cdot \bar{n} - \frac{1}{k \cdot \bar{n}} \sum_p^k {n_p}^2 \right)\]

\(n_p\), the number of individuals in population \(p\)
\(p_{ap}\) the relative frequency of allele \(a\) in population \(p\)

The return value for FistWC is a tuple with the three F-statistics estimators: \(\left(\hat{f}, \hat{\theta}, \hat{F}\right)\), which are equivalent to \(\left(F_{IS}, F_{ST}, F_{IT}\right)\) and are defined as follows:

(11)¶\[1 - \hat{f} = \frac{c}{b+c}\]

(12)¶\[\hat{\theta} = \frac{a}{a+b+c}\]

(13)¶\[1 - \hat{F} = \frac{c}{a+b+c}\]

Clusters, population and individual structure¶

If, in addition, populations are grouped in clusters, it is possible to compute an additional fixation index: the between-population fixation index \(\theta\) (or \(F_{ST}\)) is subdivided in a between-population, within-cluster component \(\theta_1\) (or \(F_{SC}\)) and a between-cluster component \(\theta_2\) (or \(F_{CT}\)). The estimators are based on four components of variance, noted \(a\) (between clusters), \(b_2\) (between populations within clusters), \(b_1\) (between individuals within populations), and \(c\) (within individuals). They are computed as described in Weir and Cockerham (Evolution 1984 38:1358-1370).

\[a = \sum_i^A \frac{n_3 \epsilon_i - n_1 \delta_i - (n_3-n_1) \beta_i} {2 \cdot n_2 \cdot n_3}\]

\[b_2 = \sum_i^A \frac{\delta_i - \beta_i} {2 \cdot n_3}\]

\[b_1 = \sum_i^A \frac{1}{2} (\beta_i - \alpha_i)\]

\[c = \sum_i^A \alpha_i\]

\(\alpha\) (MSG en Weir and Cockerham’s article) is computed as:

\[\alpha_i = \frac{1}{2 n} \sum_p^k h_{ip}\]

\(\beta\) (MSI en Weir and Cockerham’s article) is computed as:

\[\beta_i = \frac{2 \sum_p^k n_p p_{ip} (1-p_{ip}) - \frac{1}{2} \sum_p^k h_{ip}} {n_t - k}\]

\(\delta\) (MSD en Weir and Cockerham’s article) is computed as:

\[\delta_i = \frac{2}{k - r} \sum_p^k n_p (p_{ip} - p_{ic_p}) ^2\]

\(\epsilon\) (MSP en Weir and Cockerham’s article) is computed as:

\[\epsilon_i = \frac{2}{r-1}\sum_c^r n_c (p_{ic} - p_i) ^2\]

with:

\(k\) number of populations with at least one individual
\(r\) number of clusters with at least one population
\(n\) total number of individuals (in considered populations)
\(n_p\) number of individuals in population \(p\)
\(n_c\) number of individuals in population \(c\)
\(p_i\) relative frequency of allele \(i\) in the whole sample
\(p_{ip}\) relative frequency of allele \(i\) in population \(p\)
\(p_{ic_p}\) relative frequency of allele \(i\) in the cluster containing population \(p\)
\(p_{ic}\) relative frequency of allele \(i\) in the cluster \(c\)
\(h_{ip}\) number of heterozygote individuals carrying allele \(i\) in population \(p\)

The return value for FisctWC is a tuple with the four F-statistics estimators: \(\left(\hat{f}, \hat{\theta}_1, \hat{\theta}_2, \hat{F}\right)\), which are equivalent to \(\left(F_{IS}, F_{SC}, F_{CT}, F_{IT}\right)\) and are defined as follows:

(14)¶\[1 - \hat{f} = \frac{c}{b_1+c}\]

(15)¶\[\hat{\theta}_1 = \frac{a+b_2}{a+b_2+b+1+c}\]

(16)¶\[\hat{\theta}_2 = \frac{a}{a+b_2+b+1+c}\]

(17)¶\[1 - \hat{F} = \frac{c}{a+b_2+b+1+c}\]

Patterson’s f statistics¶

We implement statistics of Patterson et al. (Genetics 2012 192:1065-1093) as follows.

f2 is only computed if there are two populations, each containing at least two non-missing samples and there are at most two alleles. One of the two alleles is chosen arbitrarily. The same requirements apply for f3 with three populations, one of them being be designed as focal, and for f4 and Dp with four populations organized in two clusters.

In the equations below, \(n_i\) is the sample size and \(p_i\) is the frequency of the an allele chosen arbitrarily, both for population i.

(18)¶\[f_2 = (p_1 - p_2) ^ 2 - \frac{p_1(1-p_1)}{n_1-1} - \frac{p_2(1-p_2)}{n_2-1}\]

Here is f3 assuming that population \(1\) is focal:

(19)¶\[f_3 = (p_1 - p_2)(p_1 - p _3) - \frac{p_1(1-p_1)}{n_1-1}\]

For f4 and Dp, populations \(1\) and \(2\) are assumed to belong to one cluster and populations \(3\) and \(4\) to the other one:

(20)¶\[f_4 = (p_1 - p_2) (p_3 - p_4)\]

(21)¶\[D_P = \frac{f_4} {(p_1 + p_2 - 2 p_1 p_2)(p_3 + p_4 - 2 p_3 p_4)}\]