Diversity statistics¶

This module contains tools to analyse diversity based on population genetics statistics.

The statistics are defined in a dedicated section.

`egglib.stats.ComputeStats`(...)	Customizable and efficient analysis of diversity data.
`egglib.stats.SFS`(sites[, struct, ...])	Compute the site frequency spectrum.
`egglib.stats.EHH`()	Extended Haplotype Homozygosity statistics.
`egglib.stats.pairwise_LD`(locus1, locus2[, ...])	Linkage disequilibium between a pair of loci.
`egglib.stats.matrix_LD`(align, stats[, ...])	Linkage disequilibrium statistics between all pairs of sites.
`egglib.stats.ProbaMisoriented`([align, struct])	Error rate of polymorphism orientation using an outgroup.
`egglib.stats.haplotypes_from_align`(aln[, ...])	Identification of haplotypes from an alignment.
`egglib.stats.haplotypes_from_sites`(sites[, ...])	Identification of haplotypes from a list of sites.
`egglib.stats.CodingDiversity`(args, *kwargs)	Detection of synonymous and non-synonymous sites.
`egglib.stats.paralog_pi`(align, struct_p, ...)	Compute diversity statistics for a gene family.
`egglib.stats.ParalogPi`()	Compute diversity statistics for a gene family.

class egglib.stats.ComputeStats(...)[source]¶

Customizable and efficient analysis of diversity data. This class is designed to minimize redundancy of underlying analyses so it is best to compute as many statistics as possible with a single instance. It also takes advantage of the object reuse policy, improving the efficiency of analyses when several (and especially many) datasets are examined in a row by the same instance of ComputeStats.

The constructor takes arguments that are automatically passed to the configure() method.

Statistics to compute are set using the method add_stats() which allows specifying several statistics at once and which can also be called several times to add more statistics to compute. It is possible to set at once all statistics of four pre-defined groups by passing the following codes (including the leading +: +site (Single site statistics) and/or +unphased (Unphased sites statistics) and/or +phased (Phased sites statistics), and/or +allelesize (Allele size statistics).

Added in version 3.3: three-population site configuration statistic (triconfig).

Added in version 3.4: admixture statistics (f2, f3, f4 and Dp).

Added in version 3.5: groups of statistics.

Methods

`add_stats`(stat, ...)	Add one or more statistics to compute.
`all_stats`()	Add all possible statistics.
`clear_stats`()	Clear the list of statistics to compute.
`configure`([struct, multi, multi_hits, maf, ...])	Configure the instance.
`list_stats`()	List of available statistics.
`process_align`(align[, positions, max_missing])	Analyze an alignment.
`process_freq`(frq[, position])	Analyze already computed frequencies.
`process_site`(site[, position])	Analyze a site.
`process_sites`(sites[, positions])	Analyze a list of sites.
`reset`()	Reset all currently computed statistics.
`results`()	Get computed statistics.
`set_structure`(struct)	Set the structure object.
`stats_group`(grp)	Return a copy of the list of names of statistics belonging to a given group.

add_stats(stat, ...)[source]¶: Add one or more statistics to compute. Every statistic identifier must be among the list of available statistics or one of the four statistics group identifiers (which start with a + character), regardless of what data is to be analyzed. If statistics cannot be computed, they will be returned as None.

Changed in version 3.5: Support stats groups.

all_stats()[source]¶: Add all possible statistics. Those who cannot be computed will be reported as None. Also reset all currently computed statistics (if any). Note that allele size statistics require integer alleles, so use of this method would result in an error if DNA-alphabet data are loaded. It is preferable to use add_stats('+site', '+unphased', '+phased') for adding all statistics compatible with all types of data.

clear_stats()[source]¶: Clear the list of statistics to compute.

configure(struct=None, multi=False, multi_hits=False, maf=0.0, LD_min_n=2, LD_max_maj=1.0, LD_multiallelic=0, LD_min_freq=0, Rmin_oriented=False, triconfig_min=2, f3_focus=None)[source]¶

Configure the instance. The values provided for parameters will affect all subsequent analyses. Calling this method resets all statistics.

Parameters:

struct – a Structure instance describing the structure of objects that will be passed.
multi – process several alignments, set of sites, or sites in a row and only yield statistics when results() is called (each of the method considered will return None).
multi_hits – allow multiple mutations at the same site (some statistics, like Rmin or Wall’s B and Q, never consider multiple mutations regardless of this option).
maf – minimum minor allele frequency (sites for which the relative minority allele frequency is lower than this will be excluded).
LD_min_n – Minimal number of non-missing samples. Allows to specify a more stringent filter than max_missing. Only considered for calculating Rozas et al.’s and Kelly’s statistics (the most stringent of the two criteria applies).
LD_max_maj – Maximal relative frequency of the main allele. Only considered for calculating Rozas et al.’s and Kelly’s statistics.
LD_multiallelic – One of 0 (ignore them), 1 (use main allele only), and 2 (use all possible pairs of alleles). Defines what is done for pairs of sites of which one or both have more than two alleles (while computing linkage disequilibrium). In case of option 2, a filter can be applied with option LD_min_freq. Only considered for calculating Rozas et al.’s and Kelly’s statistics.
LD_min_freq – Only considered if option 2 is used for LD_multiallelic. Only consider alleles that are in absolute frequency equal to or larger than the given value. Only considered for calculating Rozas et al.’s and Kelly’s statistics
Rmin_oriented – Only for computing Rmin: use only orientable sites.
triconfig_min – Only for site configurations with three populations: minimal number of available sample per population.
f3_focus – Specify the label of the focal population (required for computing f3).

Added in version 3.3: triconfig_min argument.

Added in version 3.3: f3_focus argument.

list_stats()[source]¶: List of available statistics. Returns a list of tuples giving, for each available statistic, its code and a short description.

process_align(align, positions=None, max_missing=0.0)[source]¶

Analyze an alignment.

Parameters:

align – an Align instance.
positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the number of sites).
max_missing –
Maximum relative proportion of missing data (excluding outgroup). The default is to exclude all sites with missing data. Missing data include all ambiguity characters and alignment gaps. Sites not passing this threshold are ignored for all analyses.

Note

Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.

Returns:

A dictionary of statistics, unless multi is set.

process_freq(frq, position=None)[source]¶

Analyze already computed frequencies.

Parameters:

frq – a Freq instance.
position – position of the site. Must be an integer value. By default, use the site loading index.

Returns:

A dictionary of statistics, unless multi is set.

process_site(site, position=None)[source]¶

Analyze a site.

Parameters:

site – a Site instance.
position – position of the site. Must be an integer value. By default, use the site index.

Returns:

A dictionary of statistics, unless multi is set.

process_sites(sites, positions=None)[source]¶

Analyze a list of sites.

Parameters:

sites – a list, or other sequence of Site instance. (VcfWindow instances with a number of sites greater than 0 are also supported.)
positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the length of sites).

Returns:

A dictionary of statistics, unless multi is set.

reset()[source]¶: Reset all currently computed statistics. Keep the list of statistics to compute.

results()[source]¶: Get computed statistics. Return the value of statistics for all sites since the last call to this method, to reset(), or any addition of statistics, or the object creation, whichever is most recent. For statistics that can not be computed, None is returned.

set_structure(struct)[source]¶: Set the structure object. Reset statistics but don’t change the list of statistics to compute and don’t change over parameter values.

stats_group(grp)[source]¶: Return a copy of the list of names of statistics belonging to a given group. If the group name is invalid, raise a KeyError.

Added in version 3.5.

egglib.stats.SFS(sites, struct=None, max_missing=0.0, mode='auto', nbins=None, skip_fixed=False)[source]¶

Compute the site frequency spectrum. This function processes an list of sites (or any iterator yielding Site instances) and computes the distribution of frequencies of either the minority or the derived allele of diallelic sites.

Parameters:

sites – an iterator over Site instances. Unless nbins is specified, all sites are required to have the same number of samples.
struct – a Structure instance defining the samples to process and/or the outgroup samples (necessary for the unfolded spectrum).
max_missing – maximum relative proportion of missing data in the ingroup to process a site. Warning: when using this option without binarization (absolute frequencies), the results might be unreliable as sites with missing data have less samples.
mode – indicate whether the SFS should be folded or unfolded (possible only when outgroup is present in struct). auto: as implied by the outgroup; folded: folded (even if there is an outgroup); unfolded: unfolded (causing an error if no outgroup is present).

Params nbins:

number of bins on which to calculate the SFS. If None, the full SFS is returned as a list.

Params skip_fixed:

boolean indicating whether fixed sites should be counted of not. If set to True with a full, folded SFS (nbins=None with no outgroup used), the first and last count values are set to None but kept in the returned list.

Returns:

If bins are specified: a list of (bound, count) tuples where bound is the top limit of each bin, which is inclusive, and count is the number of sites falling in this category. Otherwise (nbins=None): a list containing the absolute SFS, giving the number of sites corresponding to all possible frequencies ranging from 0 to n/2+1 (if folded) or n-1 (unfolded), inclusive.

Added in version 3.4.

class egglib.stats.EHH[source]¶

Extended Haplotype Homozygosity statistics. Statistics can be computed for unphased genotypic data.

The usage of this class is to: to first set the core haplotypes by passing a Site to set_core() and then to load repetitively distant sites (always with increasing distance from the core), through load_distant(), until the list of sites to process is exhausted or one of the thresholds has been reached.

In order to process distant sites using the same core region to the opposite direction, or to use a different core region, it is always required to call set_core() again (this is the only way to reset the instance).

After at least one distant site site is loaded, EHH statistics can be accessed using there accessors. EHH statistics fell into four categories:

Raw EHH statistics, provided for each loaded distant sites and separately for each core haplotype. See the methods: get_EHH(), get_EHHc(), and get_rEHH().

Reference

Sabeti P.C., D.E. Reich, J.M. Higgins, H.Z.P. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, H.C. Ackerman, S.J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward & E.S. Lander. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832-837.
Integrated EHH statistics, computed separately for each core haplotype and incremented at each provided sites until the EHH value reaches the thresholds provided as the EHH_thr and EHHc_thr arguments to set_core(). See the methods get_iHH(), get_iHHc(), and get_iHS(). The methods done_EHH() and done_EHHc() allow to check whether the threshold has been reached for all genotypes.

Reference

Voight B.F., S. Kudaravalli, X. Wen & J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e772.
Whole-site EHHS statistic and its integrated version iES, which is incremented as long that the EHHS value is larger than or equal to the threshold provided as the EHHS_thr argument to set_core(). See the methods get_EHHS() and get_iES(). If data are unphased, a specific EHHS estimate based on homozygosity is provided (EHHG, with its iEG integrated counter-part).

Reference

Tang K., K.R. Thornton & M. Stoneking. 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: e171.
EHH, EHHS, and EHHG decay statistics, that give the minimal distance at which, respectively, EHH starts to be smaller than the threshold provided as the EHH_thr argument to set_core(), EHHS starts to be smaller than the threshold provided as the EHHS_thr argument to set_core(), and EHHG starts to be smaller than the threshold provided as the EHHG_thr argument to set_core(). EHH decay is computed separately for each core haplotype. See the methods get_dEHH(), get_dEHHS(), and get_dEHHG(). The values are not available until the respective threshold is reached (None is returned). The method done_EHH() allows to check whether the threshold for the EHH decay has been reached for all core haplotypes. The maximum and average value of the EHH decay across all core haplotypes can be accessed with get_dEHH_max() and get_dEHH_mean(), respectively.

Reference

Ramu00EDrez-Soriano A., S.E. Ramos-Onsins, J. Rozas, F. Calafell & A. Navarro. 2008. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics 179 : 555-567.

In all cases, None is returned when the value is not available (no data loaded, division by zero in the case of ratios, or threshold not reached in the case of decay statistics).

The thresholds must all be within the range between 0 and 1.

Attributes

`cur_haplotypes`	Current number of haplotypes.
`nsam`	Current number of non-missing samples.
`num_haplotypes`	Number of core haplotypes.

Methods

`done_EHH`()	Completed EHH calculation.
`done_EHHc`()	Completed EHHc calculation.
`get_EHH`(i)	Get an EHH value.
`get_EHHG`()	Get EHHG value.
`get_EHHS`()	Get EHHS value.
`get_EHHc`(i)	Get an EHHc value.
`get_dEHH`(i)	EHH decay.
`get_dEHHG`()	EHHG decay.
`get_dEHHS`()	EHHS decay.
`get_dEHH_max`()	Maximum EHH decay.
`get_dEHH_mean`()	Average EHH decay.
`get_dEHHc`(i)	EHHc decay.
`get_iEG`()	Get iEG value.
`get_iES`()	Get iES value.
`get_iHH`(i)	Get an iHH value.
`get_iHHc`(i)	Get an iHHc value.
`get_iHS`(i)	Get an iHS value.
`get_rEHH`(i)	Get a rEHH value.
`load_distant`(site)	Process a distant site.
`nsam_core`(hap)	Number of samples for a core haplotype.
`nsam_hap`(hap)	Number of samples for a haplotype.
`set_core`(site[, struct, min_freq, min_sam, ...])	Specify the core site.

property cur_haplotypes¶: Current number of haplotypes. None if core has not been set, equal to num_haplotypes if no distant site has been loaded.

done_EHH()[source]¶: Completed EHH calculation. Return True if the values of iHH for all core haplotypes have finished integrating (and all dEHH values have been evaluated).

done_EHHc()[source]¶: Completed EHHc calculation. Return True if the values of iHHc for all core haplotypes have finished integrating (and all dEHH values have been evaluated).

get_EHH(i)[source]¶: Get an EHH value. Get the EHH value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_EHHG()[source]¶: Get EHHG value. Get the EHHS (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples, no indiv_struct option used).

get_EHHS()[source]¶: Get EHHS value. Get the EHHS value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_EHHc(i)[source]¶: Get an EHHc value. Get the EHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_dEHH(i)[source]¶: EHH decay. Get the EHH decay distance for core haplotype i. Return None if the EHH threshold has not been reached.

get_dEHHG()[source]¶: EHHG decay. Get the EHHS (computed with genotypes) decay distance. Return None if the EHHG threshold has not been reached.

get_dEHHS()[source]¶: EHHS decay. Get the EHHS decay distance. Return None if the EHHS threshold has not been reached.

get_dEHH_max()[source]¶: Maximum EHH decay. Get the maximum EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.

get_dEHH_mean()[source]¶: Average EHH decay. Get the average EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.

get_dEHHc(i)[source]¶: EHHc decay. Get the EHHc decay distance for core haplotype i. Return None if the EHHc threshold has not been reached.

get_iEG()[source]¶: Get iEG value. Get the iES (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples or indiv_struct option was not used).

get_iES()[source]¶: Get iES value. Get the iES value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_iHH(i)[source]¶: Get an iHH value. Get the iHH value for the last processed distant site for core haplotype i . Return None if the value cannot be computed (no available samples).

get_iHHc(i)[source]¶: Get an iHHc value. Get the iHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_iHS(i)[source]¶: Get an iHS value. Get the iHS value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).

get_rEHH(i)[source]¶: Get a rEHH value. Get the rEHH value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).

load_distant(site)[source]¶

Process a distant site. The core site must have been specified. The site must have a specified position.

Parameters:: site – a Site instance containing data for the distant site. It must be consistent with the core site (same list of samples, and in the same order). Missing data are supported.

property nsam¶: Current number of non-missing samples.

nsam_core(hap)[source]¶: Number of samples for a core haplotype. Number of non-missing samples for one of the core haplotypes.

nsam_hap(hap)[source]¶: Number of samples for a haplotype. Number of non-missing samples for one of the current haplotypes.

property num_haplotypes¶: Number of core haplotypes. None if core has not been set.

set_core(site, struct=None, min_freq=None, min_sam=2, EHH_thr=None, EHHc_thr=None, EHHS_thr=None, EHHG_thr=None, crop_EHHS=False)[source]¶

Specify the core site. If the instance already contains data, it will be reset.

Parameters:

site – a Site instance. The site must have a specified position. It is assumed that data are phased (samples should be phased; but it is possible to load genotypes whose phase is unknown, as long as the genotypes are phased at the individual level: see indiv_struct).
struct – This argument must be set to a non-None value if genotypic data are provided and if unphased versions of statistics should be computed. If not None, this argument should either be a Structure object mapping samples to individuals (other structure levels are not considered), or an integer specifying the ploidy (in the latter case, samples of any individuals are supposed to be consecutive in the site. It is still required that individuals are phased (that is, that they are entered in the same order in all distant sites). The integer 0 is accepted as a synonym of None (data will be treated as haploid).
min_freq – minimal absolute frequency for haplotypes (haplotypes with lower frequencies are ignored). By default, all haplotypes are considered.
min_sam – minimal absolute number of non-missing samples to compute statistics or stop incrementing them (for statistics expressed per core haplotype, number of samples for this haplotype).
EHH_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (None), iHH is permanently incremented and dEHH is not evaluated at all.
EHHc_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (if None), use the same value as for EHH_thr.
EHHS_thr – threshold determining when iES should stop integrating and dEHHS must be evaluated. By default (None), iES is permanently incremented and dEHHS is not evaluated at all.
EHHG_thr – threshold determining when iEG (iES computed for unphased data) should stop integrating and dEHHG (equivalent to dEHHS) must be evaluated. Must be None if genotypes is True. By default (None), iEG is permanently incremented and dEHHG is not evaluated at all. By default: equal to EHHS_thr.
crop_EHHS – if True, set values of EHHS that are below the threshold to 0 to emulate the behaviour of the R package rehh (also affects iES).

egglib.stats.pairwise_LD(locus1, locus2, struct=None, multiple_policy='main', min_freq=0)[source]¶

Linkage disequilibium between a pair of loci.

Site instances must be used to describe loci. Only the order of samples is considered and both loci must have the same number of samples.

Parameters:

locus1 – A Site instance.
locus2 – A Site instance.
struct – A Structure object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.
multiple_policy – Determine what is done if either input locus has more than two alleles. Possible values are, "forbid" (raise an exception if this occurs), "main" (take the most frequent allele of each locus) and "average" ( compute the unweighted average over all possible pair of alleles). More options might be added in future versions. This option is ignored if both loci have less than three alleles. If "main" and there are several equally most frequent alleles, the first-occurring one is used (arbitrarily).
min_freq – Only used if at least one site has more than two alleles and multiple_policy is set to average. Set the minimum absolute frequency to consider an allele.

Returns:

A dict of linkage disequilibrium statistics. In case statistics cannot be computed (either site fixed, or less than two samples with non-missing data at both sites), computed values are replaced by None. n gives the number of pairs of alleles considered (0 if statistics are not computed at all).

egglib.stats.matrix_LD(align, stats, struct=None, multiple_policy='main', min_freq=0, min_n=2, max_maj=1.0, positions=None)[source]¶

Linkage disequilibrium statistics between all pairs of sites. The computed statistics are selected by an argument of this function. Return a matrix (as a nested list) of the requested statistics. In all cases, all pairs of sites are present in the returned matrices. If statistics cannot be computed, they are replaced by None.

The available statistics are:

d – Distance between sites of the pairs.
D – Standard linkage disequilibrium.
Dp – Lewontin’s D’.
r – Correlation coefficient.
rsq – Equivalent to r².

Parameters:

align – An Align instance.
stats – Requested statistic or statistics (see list of available statistics above, as a single string or as a list of one or more of these statistics (in any order).
struct – A Structure object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.
multiple_policy – Specify what is done for pairs of sites for which at least one locus has only one allele. See pairwise_LD() for further description.
min_freq – Only used if at least one site has more than two alleles and depending on the value of multiple_policy. See pairwise_LD() for further description.
min_n – Minimum number of samples used (this value must always be larger than 1). Sites not fulfilling this criterion will be dropped.
max_maj – Maximum relative frequency of the majority allele. Sites not fulfilling this criterion will be dropped.
positions – A sequence of positions, whose length must match the number of sites of the provided alignment. Used in the return value to describe the used sites, and, if requested, to compute the distance between sites. By default, the position of sites in the original alignment is used.

Returns:

A tuple with two items: first is the list of positions of sites used in the matrix (a subset of the sites of the provided alignment), with positions provided by the corresponding argument (by default, the index of sites); second is the matrix, as the nested lower half matrix. The matrix contains items for all i and j indexes with 0 <= j <= i < n where n is the number of retained sites. The content of the matrix is represented by a single value (if a single statistic has been requested) or as a list of 1 or more values (if a list of 1 or more, accordingly, statistics have been requested), or None for the diagonal or if the pairwise comparison was dropped for any reason.

class egglib.stats.ProbaMisoriented(align=None, struct=None)[source]¶

Error rate of polymorphism orientation using an outgroup.

Parameters:

align – an Align containing the sites to analyse.
struct – a Structure instance.

Only sites that are either variable within the ingroup or have a fixed difference with respect to the outgroup are considered. Sites with more than two different alleles in the ingroup, or more than one allele in the outgroup, are ignored.

This function is an implementation of the method mentioned in Baudry and Depaulis (2003), allowing to estimate the probability that a site oriented using the provided outgroup have be misoriented due to a homoplasic mutation in the branch leading to the outgroup. Note that this estimation neglects the probability of shared polymorphism.

If the instance is created with an alignment as constructor argument, then the statistics are computed. The method load_align() does the same from an existing ProbaMisoriented instance. Otherwise, individual sites can be loaded with load_site(), and then the statistics can be computed using compute() (the latter is preferable if generate Freq instances for an other use).

Reference

Baudry, E. & F. Depaulis. 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics 165: 1619-1622.

Added in version 3.0.0.

Attributes

`D`	Number of loaded sites with a fixed difference to the outgroup.
`S`	Number of loaded polymorphic sites.
`TiTv`	Transition and transversion rate ratio.
`pM`	Probability of misorientation.

Methods

`compute`()	Compute `pM` and `TiTv` statistics.
`load_align`(align, struct)	Load all sites of align that meet criteria.
`load_site`(freq)	Load a single site.
`reset`()	Clear all loaded or computed data.

property D¶: Number of loaded sites with a fixed difference to the outgroup.

property S¶: Number of loaded polymorphic sites. Only the ingroup is considered to classify a site as polymorphic.

property TiTv¶: Transition and transversion rate ratio. None if the value cannot be computed (no loaded data or null transversion rate). Requires that compute() has been called.

compute()[source]¶: Compute pM and TiTv statistics. Requires that sites have been loaded using load_site(). This method does not reset the instance.

load_align(align, struct)[source]¶

Load all sites of align that meet criteria. If there are previously loaded data, they are discarded. This method computes statistics. Data are required to be DNA sequences.

Parameters:

align – an Align instance.
struct – a Structure instance.

load_site(freq)[source]¶

Load a single site. If there are previously loaded data, they are retained. To actually compute the misorientation probability, the user must call compute().

Parameters:: freq – a Freq instance.

Warning

The origin of data must be DNA sequences. This is currently not checked here.

property pM¶: Probability of misorientation. None if the value cannot be computed (no loaded data, no valid polymorphism, null transversion rate). Requires that compute() has been called.

reset()[source]¶: Clear all loaded or computed data.

egglib.stats.haplotypes_from_align(aln, max_missing=0.0, impute_threshold=0, struct=None, dest=None, multiple=False)[source]¶

Identification of haplotypes from an alignment. Identify haplotypes an Align instance and return data as a single Site instance containing one sample for each sample of the original data. Alleles in the returned site are representing all identified haplotypes (or missing data when the haplotypes could not be derived).

Note

There must be at least one site with at least two alleles (overall, including the outgroup), otherwise the produced site only contains missing data.

Parameters:

aln – an Align instance.
impute_threshold – by default, all samples with a least one occurrence of missing data will be treated as missing data. If this argument is more than 0, the provided value will be used as maximum number of missing data. All samples with as many or less missing data will be processed to determine which extant of haplotype they might belong (to which they are identical save for missing data). If there is only one such haplotype, the corresponding samples will be treated as a repetition of this haplotype. This option will never allow detecting new haplotypes. Only small values of this option make sense.
struct – a Structure instance defining the samples to process. The population and cluster structures are not used. If the ploidy is larger than 1, the individuals are used, and sites are assumed to be phased.
max_missing –
maximum relative proportion of missing data in the ingroup to process a site.

Note

Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
dest – a Site instance that will be reset and in which data will be placed. If specified, this function returns nothing. By default, the function returns a new Site instance.
multiple – allow sites with more than two alleles in the ingroup.

Returns:

A Site instance, (if dest is None) or None (otherwise).

Note

If an empty list of sites is provided and no structure is provided, a ValueError is raised. To support smoothly possibly empty lists, the used is prompted to provide a structure.

egglib.stats.haplotypes_from_sites(sites, impute_threshold=0, struct=None, dest=None, multiple=False)[source]¶: Identification of haplotypes from a list of sites. Similar to haplotypes_from_align() but takes a list of Site instances. No option max_missing is available; all sites are always considered.

class egglib.stats.CodingDiversity(*args, **kwargs)[source]¶

Detection of synonymous and non-synonymous sites. This class processes alignments with a reading frame specification in order to detect synonymous and non-synonymous variable positions. It provides basic statistics, but it can also filter data to let the user compute all other statistics on synonymous-only, or non-synonymous-only variation (e.g. \(\pi\) or D).

The constructor takes optional arguments. By default, build an empty instance. If arguments are passed, they must match the signature of process() that will be called.

Note that codons with missing data are never considered, even if the resolution of all possibilities consistent translate to the same nucleotide. This applies to stop and start codons. Alternative start codons are not considered either.

Attributes

`num_codons_eff`	Number of analysed codon sites.
`num_codons_stop`	Number of codon sites with at least one codon stop.
`num_codons_tot`	Total number of considered codon sites.
`num_multiple_alleles`	Number of polymorphic coding sites with more than two alleles.
`num_multiple_hits`	Number of polymorphic codons for which more than one position is changed.
`num_pol_NS`	Number of polymorphic codon sites with only non-synonymous variation.
`num_pol_S`	Number of polymorphic codon sites with only synonymous variation.
`num_pol_single`	Number of polymorphic coding sites with only one mutation.
`num_sites_NS`	Estimated number of non-synonymous sites.
`num_sites_S`	Estimated number of synonymous sites.
`positions_NS`	List of positions of sites with only non-synonymous variation
`positions_S`	List of positions of sites with only synonymous variation
`sites_NS`	List of sites with only non-synonymous variation
`sites_S`	List of sites with only synonymous variation

Methods

process(align[, code, struct, max_missing, ...])

Process an alignment.

property num_codons_eff¶: Number of analysed codon sites. Like num_codons_tot but excluding sites rejected because of missing data.

property num_codons_stop¶: Number of codon sites with at least one codon stop.

property num_codons_tot¶: Total number of considered codon sites. Only complete codons have been considered, but this value includes codons that have been rejected because of missing data.

property num_multiple_alleles¶: Number of polymorphic coding sites with more than two alleles. These sites are included only if multiple_alleles is True except those who mix synonymous and non-synonymous changes (they can be rejected if there are more than two alleles in total as well).

property num_multiple_hits¶: Number of polymorphic codons for which more than one position is changed. These sites are included only if multiple_hits is True and depdenting on the total number of alleles.

property num_pol_NS¶: Number of polymorphic codon sites with only non-synonymous variation.

property num_pol_S¶: Number of polymorphic codon sites with only synonymous variation.

property num_pol_single¶: Number of polymorphic coding sites with only one mutation. All these sites are always included.

property num_sites_NS¶

Estimated number of non-synonymous sites. Note that the total number of sites per codon is always 3.

The numbers of non-synonymous and synonymous sites are estimated using the method of Nei & Gojobori (Mol. Biol. Evol. 1986 3:418-426).

property num_sites_S¶: Estimated number of synonymous sites. Note that the total number of sites per codon is always 3.

property positions_NS¶: List of positions of sites with only non-synonymous variation

property positions_S¶: List of positions of sites with only synonymous variation

process(align, code=1, struct=None, max_missing=0.0, skipstop=True, raise_stop=False, multiple_alleles=False, multiple_hits=False)[source]¶

Process an alignment. It this instance already had data in memory, they will all be erased.

Parameters:

align – an Align instance containing the coding sequence to process. The alphabet must be codons.
code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
struct – a class:Structure object specifying samples to include and outgroup samples to skip for analysis (outgroup samples are ignored to determine if site is polymorphic, coding, or non-coding, but included in generated sites). By default, use all samples.
max_missing –
maximum relative proportion of missing data (per codon site) to allow (including stop codons if skipstop if true). By default, all codon sites with any missing data are excluded. Outgroup is not considered.

Note

Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
skipstop – if True, stop codons are treated as missing data and skipped. If so, potential mutations to stop codons are not taken into account when estimating the number of non-synonymous sites. Warning (this may be counter-intuitive): it actually assumes that stop codons are not biologically plausible and considers them as missing data. On the other hand, if skipstop is False, it considers stop codons as if they were valid amino acids. This option has no effect if raise_stop is True.
raise_stop – raise a ValueError if a stop codon is met. If True, skipstop has no effect. Outgroup is not considered.
multiple_alleles – include coding sites with more than two alleles (regardless of whether mutations hit the same position within the triplet). If there are more than two different codons, they must either encode for the same amino acid or all encode for different amino acids (otherwise the site is rejected). If more than one of the three codon position are mutated, the option multiple_hits is considered.
multiple_hits – include coding sites for which more than one of the three positions has changed (regardless of the number of alleles). If there are more than two alleles, the option multiple_alleles is also considered.

property sites_NS¶: List of sites with only non-synonymous variation

property sites_S¶: List of sites with only synonymous variation

egglib.stats.paralog_pi(align, struct_p, struct_i, max_missing=0.0)[source]¶

Compute diversity statistics for a gene family. Based on Innan (Genetics 2003 163:803-810). An estimate of genetic diversity is provided for every paralog and for every pair of paralogs, provided that enough non-missing data is available (at least 2 samples are required). Note that sites with more than two alleles are always considered.

Parameters:

align – an Align containing the sequence of all available paralog for all samples. The outgroup is ignored.
struct_p – a Structure providing the organisation of sequences in paralogs. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all individuals for a given paralog should be grouped together in that structure. There might be a different number of sequences per paralog due to missing data.
struct_i – a Structure providing the organisation of sequences in individuals. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all paralogs for a given individual should be grouped together in that structure. There might be a different number of sequences per individual due to missing data.
max_missing –
maximum relative proportion of missing data (if there are more missing data at a site, the site is ignored altogether).

Note

Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.

Returns:

A new ParalogPi instance which provides methods to access the number of used sites and the diversity for each paralog/paralog pair.

class egglib.stats.ParalogPi[source]¶

Compute diversity statistics for a gene family. See paralog_pi() for more details. This class can be used directly (1) to analyse data with more efficiency (by reusing the same instance) or (2) to combine data from different alignments, or (3) for pass individual sites. Do first call setup().

Methods

`Pib`(i, j)	Between-paralog diversity for paralogs i and j.
`Piw`(i)	Within-paralog diversity for paralog i.
`num_sites`([i[, j]])	Number of sites with data.
`process_align`(aln)	Process an alignment.
`process_site`(site)	Process a site.
`setup`(struct_p, struct_i[, max_missing])	Specify the structure in paralog and individuals.

Pib(i, j)[source]¶: Between-paralog diversity for paralogs i and j.

Piw(i)[source]¶: Within-paralog diversity for paralog i.

num_sites([i[, j]])[source]¶: Number of sites with data. Number of sites: with any data (without arguments), with data for paralog i (if only i specified), with data for the pair of paralogs i and j (if both specified). This counts the number of sites which have not been excluded based on the num_missing argument, and which have at least one pair of samples for the considered sample.

process_align(aln)[source]¶

Process an alignment. The alignment must match the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:: aln – an Align instance.

process_site(site)[source]¶

Process a site. The site must match the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:: site – a Site instance.

setup(struct_p, struct_i, max_missing=0.0)[source]¶: Specify the structure in paralog and individuals. The arguments are as as described for paralog_pi(). Only this method resets the instance.