# Diversity statistics¶

This module contains tools to analyse diversity based on population genetics statistics.

The statistics are defined in a dedicated section.

 Customizable and efficient analysis of diversity data. Extended Haplotype Homozygosity statistics. egglib.stats.pairwise_LD(locus1, locus2[, ...]) Linkage disequilibium between a pair of loci. egglib.stats.matrix_LD(align, stats[, ...]) Linkage disequilibrium statistics between all pairs of sites. egglib.stats.ProbaMisoriented([align, struct]) Error rate of polymorphism orientation using an outgroup. egglib.stats.haplotypes_from_align(aln[, ...]) Identification of haplotypes from an alignment. egglib.stats.haplotypes_from_sites(sites[, ...]) Identification of haplotypes from a list of sites. egglib.stats.CodingDiversity(*args, **kwargs) Detection of synonymous and non-synonymous sites. egglib.stats.paralog_pi(align, struct_p, ...) Compute diversity statistics for a gene family. Compute diversity statistics for a gene family.
class egglib.stats.ComputeStats(...)[source]

Customizable and efficient analysis of diversity data. This class is designed to minimize redundancy of underlying analyses so it is best to compute as many statistics as possible with a single instance. It also takes advantage of the object reuse policy, improving the efficiency of analyses when several (and especially many) datasets are examined in a row by the same instance of ComputeStats.

The constructor takes arguments that are automatically passed to the configure() method.

Statistics to compute are set using the method add_stats() which allows specifying several statistics at once and which can also be called several times to add more statistics to compute.

Methods

 add_stats(stat, ...) Add one or more statistics to compute. Add all possible statistics. Clear the list of statistics to compute. configure([struct, multi, multi_hits, maf, ...]) Configure the instance. List of available statistics. process_align(align[, positions, max_missing]) Analyze an alignment. process_freq(frq[, position]) Analyze already computed frequencies. process_site(site[, position]) Analyze a site. process_sites(sites[, positions]) Analyze a list of sites. Reset all currently computed statistics. Get computed statistics. set_structure(struct) Set the structure object.

Add one or more statistics to compute. Every statistic identifier must be among the list of available statistics, regardless of what data is to be analyzed. If statistics cannot be computed, they will be returned as None.

all_stats()[source]

Add all possible statistics. Those who cannot be computed will be reported as None. Also reset all currently computed statistics (if any).

clear_stats()[source]

Clear the list of statistics to compute.

configure(struct=None, multi=False, multi_hits=False, maf=0.0, LD_min_n=2, LD_max_maj=1.0, LD_multiallelic=0, LD_min_freq=0, Rmin_oriented=False)[source]

Configure the instance. The values provided for parameters will affect all subsequent analyses. Calling this method resets all statistics.

Parameters:
• struct – a Structure instance describing the structure of objects that will be passed.

• multi – process several alignments, set of sites, or sites in a row and only yield statistics when results() is called (each of the method considered will return None).

• multi_hits – allow multiple mutations at the same site (some statistics, like Rmin or Wall’s B and Q, never consider multiple mutations regardless of this option).

• maf – minimum minor allele frequency (sites for which the relative minority allele frequency is lower than this will be excluded).

• LD_min_n – Minimal number of non-missing samples. Allows to specify a more stringent filter than max_missing. Only considered for calculating Rozas et al.’s and Kelly’s statistics (the most stringent of the two criteria applies).

• LD_max_maj – Maximal relative frequency of the main allele. Only considered for calculating Rozas et al.’s and Kelly’s statistics.

• LD_multiallelic – One of 0 (ignore them), 1 (use main allele only), and 2 (use all possible pairs of alleles). Defines what is done for pairs of sites of which one or both have more than two alleles (while computing linkage disequilibrium). In case of option 2, a filter can be applied with option LD_min_freq. Only considered for calculating Rozas et al.’s and Kelly’s statistics.

• LD_min_freq – Only considered if option 2 is used for LD_multiallelic. Only consider alleles that are in absolute frequency equal to or larger than the given value. Only considered for calculating Rozas et al.’s and Kelly’s statistics

• Rmin_oriented – Only for computing Rmin: use only orientable sites.

list_stats()[source]

List of available statistics. Returns a list of tuples giving, for each available statistic, its code and a short description.

process_align(align, positions=None, max_missing=0.0)[source]

Analyze an alignment.

Parameters:
• align – an Align instance.

• positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the number of sites).

• max_missing – Maximum proportion of missing data (excluding outgroup). The default is to exclude all sites with missing data. Missing data include all ambiguity characters and alignment gaps. Sites not passing this threshold are ignored for all analyses.

Returns:

A dictionary of statistics, unless multi is set.

process_freq(frq, position=None)[source]

Parameters:
• frq – a Freq instance.

• position – position of the site. Must be an integer value. By default, use the site loading index.

Returns:

A dictionary of statistics, unless multi is set.

process_site(site, position=None)[source]

Analyze a site.

Parameters:
• site – a Site instance.

• position – position of the site. Must be an integer value. By default, use the site index.

Returns:

A dictionary of statistics, unless multi is set.

process_sites(sites, positions=None)[source]

Analyze a list of sites.

Parameters:
• sites – a list, or other sequence of Site instance. Or a sliding window as object of the VcfWindow with a number of sites greater than 0.

• positions – a list, or other sequence of positions for all sites, or None. If None, use the index of each site. Otherwise, must be a sequences of integer values (length matching the length of sites).

Returns:

A dictionary of statistics, unless multi is set.

reset()[source]

Reset all currently computed statistics. Keep the list of statistics to compute.

results()[source]

Get computed statistics. Return the value of statistics for all sites since the last call to this method, to reset(), or any addition of statistics, or the object creation, whichever is most recent. For statistics that can not be computed, None is returned.

set_structure(struct)[source]

Set the structure object. Reset statistics but don’t change the list of statistics to compute and don’t change over parameter values.

class egglib.stats.EHH[source]

Extended Haplotype Homozygosity statistics. Statistics can be computed for unphased genotypic data.

The usage of this class is to: first, set the core haplotypes by passing a Site to set_core(), and then load repetitively distant sites (always with increasing distance from the core), through load_distant(), until the list of sites to process is exhausted or one of the thresholds has been reached.

In order to process distant sites using the same core region to the opposite direction, or to use a different core region, it is always required to call set_core() again (this is the only way to reset the instance).

After at least one distant site site is loaded, EHH statistics can be accessed using there accessors. EHH statistics fell into four categories:

1. Raw EHH statistics, provided for each loaded distant sites and separately for each core haplotype. See the methods: get_EHH(), get_EHHc(), and get_rEHH().

Reference: Sabeti P.C., D.E. Reich, J.M. Higgins, H.Z.P. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, H.C. Ackerman, S.J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward & E.S. Lander. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832-837.

2. Integrated EHH statistics, computed separately for each core haplotype and incremented at each provided sites until the EHH value reaches the thresholds provided as the EHH_thr and EHHc_thr arguments to set_core(). See the methods get_iHH(), get_iHHc(), and get_iHS(). The methods done_EHH() and done_EHHc() allow to check whether the threshold has been reached for all genotypes.

Reference: Voight B.F., S. Kudaravalli, X. Wen & J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e772.

3. Whole-site EHHS statistic and its integrated statistic iES, which is incremented while the EHHS value is larger than or equal to the threshold provided as the EHHS_thr argument to set_core(). See the methods get_EHHS() and get_iES(). If data are unphased, a specific EHHS estimate based on homozygosity is provided (EHHG, with its iEG integrated counter-part).

Reference: Tang K., K.R. Thornton & M. Stoneking. 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: e171.

4. EHH, EHHS, and EHHG decay statistics, that give the minimal distance at which, respectively, EHH starts to be smaller than the threshold provided as the EHH_thr argument to set_core(), EHHS starts to be smaller than the threshold provided as the EHHS_thr argument to set_core(), and EHHG starts to be smaller than the threshold provided as the EHHG_thr argument to set_core(). EHH decay is computed separately for each core haplotype. See the methods get_dEHH(), get_dEHHS(), and get_dEHHG(). The values are not available until the respective threshold is reached (None is returned). The method done_dEHH() allows to check whether the threshold for the EHH decay has been reached for all core haplotypes. The maximum and average value of the EHH decay across all core haplotypes can be accessed with get_dEHH_max() and get_dEHH_mean(), respectively.

Reference: Ramírez-Soriano A., S.E. Ramos-Onsins, J. Rozas, F. Calafell & A. Navarro. 2008. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics 179 : 555-567.

In all cases, None is returned when the value is not available (no data loaded, division by zero in the cases or ratios, or threshold not reached in the case of decay statistics).

The thresholds must all be within the range between 0 and 1.

Attributes

 cur_haplotypes Current number of haplotypes. nsam Current number of non-missing samples. num_haplotypes Number of core haplotypes.

Methods

 Completed EHH calculation. Completed EHHc calculation. Get an EHH value. Get EHHG value. Get EHHS value. Get an EHHc value. EHH decay. EHHG decay. EHHS decay. Maximum EHH decay. Average EHH decay. EHHc decay. Get iEG value. Get iES value. Get an iHH value. Get an iHHc value. Get an iHS value. Get a rEHH value. load_distant(site) Process a distant site. nsam_core(hap) Number of samples for a core haplotype. nsam_hap(hap) Number of samples for a haplotype. set_core(site[, struct, min_freq, min_sam, ...]) Specify the core site.
property cur_haplotypes

Current number of haplotypes. None if core has not been set, equal to num_haplotypes if not distant site has been loaded.

done_EHH()[source]

Completed EHH calculation. Return True if the values of iHH for all core haplotypes have completed integrating (and all dEHH values have been evaluated).

done_EHHc()[source]

Completed EHHc calculation. Return True if the values of iHHc for all core haplotypes have completed integrating (and all dEHH values have been evaluated).

get_EHH(i)[source]

Get an EHH value. Get the EHH value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_EHHG()[source]

Get EHHG value. Get the EHHS (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples, no indiv_struct option used).

get_EHHS()[source]

Get EHHS value. Get the EHHS value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_EHHc(i)[source]

Get an EHHc value. Get the EHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_dEHH(i)[source]

EHH decay. Get the EHH decay distance for core haplotype i. Return None if the EHH threshold has not been reached.

get_dEHHG()[source]

EHHG decay. Get the EHHS (computed with genotypes) decay distance. Return None if the EHHG threshold has not been reached.

get_dEHHS()[source]

EHHS decay. Get the EHHS decay distance. Return None if the EHHS threshold has not been reached.

get_dEHH_max()[source]

Maximum EHH decay. Get the maximum EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.

get_dEHH_mean()[source]

Average EHH decay. Get the average EHH decay distance across core haplotypes. Return None if the EHH threshold has not been reached for at least one of the core haplotypes.

get_dEHHc(i)[source]

EHHc decay. Get the EHHc decay distance for core haplotype i. Return None if the EHHc threshold has not been reached.

get_iEG()[source]

Get iEG value. Get the iES (computed with genotypes) value for the last processed distant site. Return None if the value cannot be computed (no available samples or indiv_struct option was not used).

get_iES()[source]

Get iES value. Get the iES value for the last processed distant site. Return None if the value cannot be computed (no available samples).

get_iHH(i)[source]

Get an iHH value. Get the iHH value for the last processed distant site for core haplotype i . Return None if the value cannot be computed (no available samples).

get_iHHc(i)[source]

Get an iHHc value. Get the iHHc value for the last processed distant site for core haplotype i. Return None if the value cannot be computed (no available samples).

get_iHS(i)[source]

Get an iHS value. Get the iHS value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).

get_rEHH(i)[source]

Get a rEHH value. Get the rEHH value for the last processed distant site for core haplotype i. Return None if the ratio cannot be computed (no available sample or division by zero).

Process a distant site. The core site must have been specified.

Parameters:

site – a Site instance containing data for the distant site. It must be consistent with the core site (same list of samples, and in the same order). Missing data are supported.

property nsam

Current number of non-missing samples.

nsam_core(hap)[source]

Number of samples for a core haplotype. Number of non-missing samples for one of the core haplotypes.

nsam_hap(hap)[source]

Number of samples for a haplotype. Number of non-missing samples for one of the current haplotypes.

property num_haplotypes

Number of core haplotypes. None if core has not been set.

set_core(site, struct=None, min_freq=None, min_sam=2, EHH_thr=None, EHHc_thr=None, EHHS_thr=None, EHHG_thr=None, crop_EHHS=False)[source]

Specify the core site. If the instance already contains data, it will be reset.

Parameters:
• site – a Site instance. It is assumed that data are phased (samples should be phased; but it is possible to load genotypes whose phase is unknown, as long as the genotypes are phased at the individual level: see indiv_struct).

• struct – This argument must be set to a non-None value if genotypic data are provided and if unphased versions of statistics should be computed. If not None, this argument should either be a Structure object mapping samples to individuals (other structure levels are not considered), or an integer specifying the ploidy (in the latter case, samples of any individuals are supposed to be consecutive in the site. It is still required that individuals are phased (that is, that they are entered in the same order in all distant sites). The integer 0 is accepted as a synonym of None (data will be treated as haploid).

• min_freq – minimal absolute frequency for haplotypes (haplotypes with lower frequencies are ignored). By default, all haplotypes are considered.

• min_sam – minimal absolute number of non-missing samples to compute statistics or stop incrementing them (for statistics expressed per core haplotype, number of samples for this haplotype).

• EHH_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (None), iHH is permanently incremented and dEHH is not evaluated at all.

• EHHc_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (if None), use the same value as for EHH_thr.

• EHHS_thr – threshold determining when iES should stop integrating and dEHHS must be evaluated. By default (None), iES is permanently incremented and dEHHS is not evaluated at all.

• EHHG_thr – threshold determining when iEG (iES computed for unphased data) should stop integrating and dEHHG (equivalent to dEHHS) must be evaluated. Must be None if genotypes is True. By default (None), iEG is permanently incremented and dEHHG is not evaluated at all. By default: equal to EHHS_thr.

• crop_EHHS – if True, set values of EHHS that are below the threshold to 0 to emulate the behaviour of the R package rehh (also affects iES).

egglib.stats.pairwise_LD(locus1, locus2, struct=None, multiple_policy='main', min_freq=0)[source]

Linkage disequilibium between a pair of loci.

Site instances must be used to describe loci. Only the order of samples is considered and both loci must have the same number of samples.

Parameters:
• locus1 – A Site instance.

• locus2 – A Site instance.

• struct – A Structure object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.

• multiple_policy – Determine what is done if either input locus has more than two alleles. Possible values are, "forbid" (raise an exception if this occurs), "main" (take the most frequent allele of each locus) and "average" ( compute the unweighted average over all possible pair of alleles). More options might be added in future versions. This option is ignored if both loci have less than three alleles. If "main" and there are several equally most frequent alleles, the first-occurring one is used (arbitrarily).

• min_freq – Only used if at least one site has more than two alleles and multiple_policy is set to average. Set the minimum absolute frequency to consider an allele.

Returns:

A dictionary of linkage disequilibrium statistics. In case statistics cannot be computed (either site fixed, or less than two samples with non-missing data at both sites), computed values are replaced by None. n gives the number of pairs of alleles considered (0 if statistics are not computed at all).

egglib.stats.matrix_LD(align, stats, struct=None, multiple_policy='main', min_freq=0, min_n=2, max_maj=1.0, positions=None)[source]

Linkage disequilibrium statistics between all pairs of sites. The computed statistics are selected by an argument of this function. Return a matrix (as a nested list) of the requested statistics. In all cases, all pairs of sites are present in the returned matrices. If statistics cannot be computed, they are replaced by None.

The available statistics are:

• d – Distance between sites of the pairs.

• D – Standard linkage disequilibrium.

• Dp – Lewontin’s D’.

• r – Correlation coefficient.

• rsq – Equivalent to r2.

Parameters:
• align – An Align instance.

• stats – Requested statistic or statistics (see list of available statistics above, as a single string or as a list of one or more of these statistics (in any order).

• struct – A Structure object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.

• multiple_policy – Specify what is done for pairs of sites for which at least one locus has only one allele. See pairwise_LD() for further description.

• min_freq – Only used if at least one site has more than two alleles and depending on the value of multiple_policy. See pairwise_LD() for further description.

• min_n – Minimum number of samples used (this value must always be larger than 1). Sites not fulfilling this criterion will be dropped.

• max_maj – Maximum relative frequency of the majority allele. Sites not fulfilling this criterion will be dropped.

• positions – A sequence of positions, whose length must match the number of sites of the provided alignment. Used in the return value to describe the used sites, and, if requested, to compute the distance between sites. By default, the position of sites in the original alignment is used.

Returns:

Returns a tuple with two items: first is the list of positions of sites used in the matrix (a subset of the sites of the provided alignment), with positions provided by the corresponding argument (by default, the index of sites); second is the matrix, as the nested lower half matrix. The matrix contains items for all i and j indexes with 0 <= j <= i < n where n is the number of retained sites. The content of the matrix is represented by a single value (if a single statistic has been requested) or as a list of 1 or more values (if a list of 1 or more, accordingly, statistics have been requested), or None for the diagonal or if the pairwise comparison was dropped for any reason.

class egglib.stats.ProbaMisoriented(align=None, struct=None)[source]

Error rate of polymorphism orientation using an outgroup. Only sites that are either variable within the ingroup or have a fixed difference with respect to the outgroup are considered. Sites with more than two different alleles in the ingroup, or more than one allele in the outgroup, are ignored.

This function is an implementation of the method mentioned in Baudry and Depaulis (2003), allowing to estimate the probability that a site oriented using the provided outgroup have be misoriented due to a homoplasic mutation in the branch leading to the outgroup. Note that this estimation neglects the probability of shared polymorphism.

Reference: Baudry, E. & F. Depaulis. 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics  165: 1619-1622.

Parameters:

If the instance is created with an alignment as constructor argument, then the statistics are computed. The method load_align() does the same from an existing ProbaMisoriented instance. Otherwise, individual sites can be loaded with load_site(), and then the statistics can be computed using compute() (the latter is preferable if generate Freq instances for an other use).

New in version 3.0.0.

Attributes

 D Number of loaded sites with a fixed difference to the outgroup. S Number of loaded polymorphic sites. TiTv Transition and transversion rate ratio. pM Probability of misorientation.

Methods

 Compute pM and TiTv statistics. load_align(align, struct) Load all sites of align that meet criteria. load_site(freq) Load a single site. Clear all loaded or computed data.
property D

Number of loaded sites with a fixed difference to the outgroup.

property S

Number of loaded polymorphic sites. Only the ingroup is considered to classify a site as polymorphic.

property TiTv

Transition and transversion rate ratio. None if the value cannot be computed (no loaded data or null transversion rate). Requires that compute() has been called.

compute()[source]

Compute pM and TiTv statistics. Requires that sites have been loaded using load_site(). This method does not reset the instance.

Load all sites of align that meet criteria. If there are previously loaded data, they are discarded. This method computes statistics. Data are required to be DNA sequences.

Parameters:

Load a single site. If there are previously loaded data, they are retained. To actually compute the misorientation probability, the user must call compute().

Parameters:

freq – a Freq instance.

Warning

The origin of data must be DNA sequences. This is currently not checked here.

property pM

Probability of misorientation. None if the value cannot be computed (no loaded data, no valid polymorphism, null transversion rate). Requires that compute() has been called.

reset()[source]

Clear all loaded or computed data.

egglib.stats.haplotypes_from_align(aln, max_missing=0.0, impute_threshold=0, struct=None, dest=None, multiple=False)[source]

Identification of haplotypes from an alignment. Identify haplotypes an Align instance and return data as a single Site instance containing one sample for each sample of the original data. Alleles in the returned site are representing all identified haplotypes (or missing data when the haplotypes could not be derived).

Note

There must be at least one site with at least two alleles (overall, including the outgroup), otherwise the produced site only contains missing data.

Parameters:
• aln – an Align instance.

• impute_threshold – by default, all samples with a least one occurrence of missing data will be treated as missing data. If this argument is more than 0, the provided value will be used as maximum number of missing data. All samples with as many or less missing data will be processed to determine which extant of haplotype they might belong (to which they are identical save for missing data). If there is only one such haplotype, the corresponding samples will be treated as a repetition of this haplotype. This option will never allow detecting new haplotypes. Only small values of this option make sense.

• struct – a Structure instance defining the samples to process. The population and cluster structures are not used. If the ploidy is larger than 1, the individuals are used, and sites are assumed to be phased.

• max_missing – maximum proportion of missing data in the ingroup to process a site.

• dest – a Site instance that will be reset and in which data will be placed. If specified, this function returns nothing. By default, the function returns a new Site instance.

• multiple – allow sites with more than two alleles in the ingroup.

Returns:

A Site instance, (if dest is None) or None (otherwise).

Note

If an empty list of sites is provided and no structure is provided, a ValueError is raised. To support smoothly possibly empty lists, the used is prompted to provide a structure.

egglib.stats.haplotypes_from_sites(sites, impute_threshold=0, struct=None, dest=None, multiple=False)[source]

Identification of haplotypes from a list of sites. Similar to haplotypes_from_align() but takes a list of Site instances. No option max_missing is available; all sites are always considered.

class egglib.stats.CodingDiversity(*args, **kwargs)[source]

Detection of synonymous and non-synonymous sites. This class processes alignments with a reading frame specification in order to detect synonymous and non-synonymous variable positions. It provides basic statistics, but it can also filter data to let the user compute all other statistics on synonymous-only, or non-synonymous-only variation (e.g. $$\pi$$ or D).

The constructor takes optional arguments. By default, build an empty instance. If arguments are passed, they must match the signature of process() that will be called.

Note that codons with missing data are never considered, even if the resolution of all possibilities consistent translate to the same nucleotide. This applies to stop and start codons. Alternative start codons are not considered either.

Attributes

 num_codons_eff Number of analysed codon sites. num_codons_stop Number of codon sites with at least one codon stop. num_codons_tot Total number of considered codon sites. num_multiple_alleles Number of polymorphic coding sites with more than two alleles. num_multiple_hits Number of polymorphic codons for which more than one position is changed. num_pol_NS Number of polymorphic codon sites with only non-synonymous variation. num_pol_S Number of polymorphic codon sites with only synonymous variation. num_pol_single Number of polymorphic coding sites with only one mutation. num_sites_NS Estimated number of non-synonymous sites. num_sites_S Estimated number of synonymous sites. positions_NS List of positions of sites with only non-synonymous variation positions_S List of positions of sites with only synonymous variation sites_NS List of sites with only non-synonymous variation sites_S List of sites with only synonymous variation

Methods

 process(align[, code, struct, max_missing, ...]) Process an alignment.
property num_codons_eff

Number of analysed codon sites. Like num_codons_tot but excluding sites rejected because of missing data.

property num_codons_stop

Number of codon sites with at least one codon stop.

property num_codons_tot

Total number of considered codon sites. Only complete codons have been considered, but this value includes codons that have been rejected because of missing data.

property num_multiple_alleles

Number of polymorphic coding sites with more than two alleles. These sites are included only if multiple_alleles is True except those who mix synonymous and non-synonymous changes (they can be rejected if there are more than two alleles in total as well).

property num_multiple_hits

Number of polymorphic codons for which more than one position is changed. These sites are included only if multiple_hits is True and depdenting on the total number of alleles.

property num_pol_NS

Number of polymorphic codon sites with only non-synonymous variation.

property num_pol_S

Number of polymorphic codon sites with only synonymous variation.

property num_pol_single

Number of polymorphic coding sites with only one mutation. All these sites are always included.

property num_sites_NS

Estimated number of non-synonymous sites. Note that the total number of sites per codon is always 3.

The numbers of non-synonymous and synonymous sites are estimated using the method of Nei & Gojobori (Mol. Biol. Evol. 1986 3:418-426).

property num_sites_S

Estimated number of synonymous sites. Note that the total number of sites per codon is always 3.

property positions_NS

List of positions of sites with only non-synonymous variation

property positions_S

List of positions of sites with only synonymous variation

process(align, code=1, struct=None, max_missing=0.0, skipstop=True, raise_stop=False, multiple_alleles=False, multiple_hits=False)[source]

Process an alignment. It this instance already had data in memory, they will all be erased.

Parameters:
• align – an Align instance containing the coding sequence to process. The alphabet must be codons.

• code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.

• struct – a class:Structure object specifying samples to include and outgroup samples to skip for analysis (outgroup samples are ignored to determine if site is polymorphic, coding, or non-coding, but included in generated sites). By default, use all samples.

• max_missing – maximum proportion of missing data to allow (including stop codons if skipstop if true). By default, all codon sites with any missing data are excluded. Outgroup is not considered.

• skipstop – if True, stop codons are treated as missing data and skipped. If so, potential mutations to stop codons are not taken into account when estimating the number of non-synonymous sites. Warning (this may be counter-intuitive): it actually assumes that stop codons are not biologically plausible and considers them as missing data. On the other hand, if skipstop is False, it considers stop codons as if they were valid amino acids. This option has no effect if raise_stop is True.

• raise_stop – raise a ValueError if a stop codon is met. If True, skipstop has no effect. Outgroup is not considered.

• multiple_alleles – include coding sites with more than two alleles (regardless of whether mutations hit the same position within the triplet). If there are more than two different codons, they must either encode for the same amino acid or all encode for different amino acids (otherwise the site is rejected). If more than one of the three codon position are mutated, the option multiple_hits is considered.

• multiple_hits – include coding sites for which more than one of the three positions has changed (regardless of the number of alleles). If there are more than two alleles, the option multiple_alleles is also considered.

property sites_NS

List of sites with only non-synonymous variation

property sites_S

List of sites with only synonymous variation

egglib.stats.paralog_pi(align, struct_p, struct_i, max_missing=0)[source]

Compute diversity statistics for a gene family. Based on Innan (Genetics 2003 163:803-810). An estimate of genetic diversity is provided for every paralog and for every pair of paralogs, provided that enough non-missing data is available (at least 2 samples are required). Note that sites with more than two alleles are always considered.

Parameters:
• align – an Align containing the sequence of all available paralog for all samples. The outgroup is ignored.

• struct_p – a Structure providing the organisation of sequences in paralogs. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all individuals for a given paralog should be grouped together in that structure. There might be a different number of sequences per paralog due to missing data.

• struct_i – a Structure providing the organisation of sequences in individuals. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all paralogs for a given individual should be grouped together in that structure. There might be a different number of sequences per individual due to missing data.

• max_missing – maximum proportion of missing data (if there are more missing data at a site, the site is ignored altogether).

Returns:

A new ParalogPi instance which provides methods to access the number of used sites and the diversity for each paralog/paralog pair.

class egglib.stats.ParalogPi[source]

Compute diversity statistics for a gene family. See paralog_pi() for more details. This class can be used directly (1) to analyse data with more efficiency (by reusing the same instance) or (2) to combine data from different alignments, or (3) for pass individual sites. Do first call setup().

Methods

 Pib(i, j) Between-paralog diversity for paralogs i and j. Within-paralog diversity for paralog i. num_sites([i[, j]]) Number of sites with data. Process an alignment. process_site(site) Process a site. setup(struct_p, struct_i[, max_missing]) Specify the structure in paralog and individuals.
Pib(i, j)[source]

Between-paralog diversity for paralogs i and j.

Piw(i)[source]

Within-paralog diversity for paralog i.

num_sites([i[, j]])[source]

Number of sites with data. Number of sites: with any data (without arguments), with data for paralog i (if only i specified), with data for the pair of paralogs i and j (if both specified). This counts the number of sites which have not been excluded based on the num_missing argument, and which have at least one pair of samples for the considered sample.

process_align(aln)[source]

Process an alignment. The alignment must match the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:

aln – an Align instance.

process_site(site)[source]

Process a site. The site must match the structure passed to setup(). Diversity estimates are incremented (no reset).

Parameters:

site – a Site instance.

setup(struct_p, struct_i, max_missing=0)[source]

Specify the structure in paralog and individuals. The arguments are as as described for paralog_pi(). Only this method resets the instance.