Diversity statistics¶
This module contains tools to analyse diversity based on population genetics statistics.
The statistics are defined in a dedicated section.
Customizable and efficient analysis of diversity data. |
|
Extended Haplotype Homozygosity statistics. |
|
|
Linkage disequilibium between a pair of loci. |
|
Linkage disequilibrium statistics between all pairs of sites. |
|
Error rate of polymorphism orientation using an outgroup. |
|
Identification of haplotypes from an alignment. |
|
Identification of haplotypes from a list of sites. |
|
Detection of synonymous and non-synonymous sites. |
|
Compute diversity statistics for a gene family. |
Compute diversity statistics for a gene family. |
- class egglib.stats.ComputeStats(...)[source]¶
Customizable and efficient analysis of diversity data. This class is designed to minimize redundancy of underlying analyses so it is best to compute as many statistics as possible with a single instance. It also takes advantage of the object reuse policy, improving the efficiency of analyses when several (and especially many) datasets are examined in a row by the same instance of
ComputeStats
.The constructor takes arguments that are automatically passed to the
configure()
method.Statistics to compute are set using the method
add_stats()
which allows specifying several statistics at once and which can also be called several times to add more statistics to compute.Methods
add_stats
(stat, ...)Add one or more statistics to compute.
Add all possible statistics.
Clear the list of statistics to compute.
configure
([struct, multi, multi_hits, maf, ...])Configure the instance.
List of available statistics.
process_align
(align[, positions, max_missing])Analyze an alignment.
process_freq
(frq[, position])Analyze already computed frequencies.
process_site
(site[, position])Analyze a site.
process_sites
(sites[, positions])Analyze a list of sites.
reset
()Reset all currently computed statistics.
results
()Get computed statistics.
set_structure
(struct)Set the structure object.
- add_stats(stat, ...)[source]¶
Add one or more statistics to compute. Every statistic identifier must be among the list of available statistics, regardless of what data is to be analyzed. If statistics cannot be computed, they will be returned as
None
.
- all_stats()[source]¶
Add all possible statistics. Those who cannot be computed will be reported as
None
. Also reset all currently computed statistics (if any).
- configure(struct=None, multi=False, multi_hits=False, maf=0.0, LD_min_n=2, LD_max_maj=1.0, LD_multiallelic=0, LD_min_freq=0, Rmin_oriented=False, triconfig_min=2)[source]¶
Configure the instance. The values provided for parameters will affect all subsequent analyses. Calling this method resets all statistics.
- Parameters:
struct – a
Structure
instance describing the structure of objects that will be passed.multi – process several alignments, set of sites, or sites in a row and only yield statistics when
results()
is called (each of the method considered will returnNone
).multi_hits – allow multiple mutations at the same site (some statistics, like Rmin or Wall’s B and Q, never consider multiple mutations regardless of this option).
maf – minimum minor allele frequency (sites for which the relative minority allele frequency is lower than this will be excluded).
LD_min_n – Minimal number of non-missing samples. Allows to specify a more stringent filter than max_missing. Only considered for calculating Rozas et al.’s and Kelly’s statistics (the most stringent of the two criteria applies).
LD_max_maj – Maximal relative frequency of the main allele. Only considered for calculating Rozas et al.’s and Kelly’s statistics.
LD_multiallelic – One of 0 (ignore them), 1 (use main allele only), and 2 (use all possible pairs of alleles). Defines what is done for pairs of sites of which one or both have more than two alleles (while computing linkage disequilibrium). In case of option 2, a filter can be applied with option LD_min_freq. Only considered for calculating Rozas et al.’s and Kelly’s statistics.
LD_min_freq – Only considered if option 2 is used for LD_multiallelic. Only consider alleles that are in absolute frequency equal to or larger than the given value. Only considered for calculating Rozas et al.’s and Kelly’s statistics
Rmin_oriented – Only for computing Rmin: use only orientable sites.
triconfig_min – Only for site configurations with three populations: minimal number of available sample per population.
- list_stats()[source]¶
List of available statistics. Returns a list of tuples giving, for each available statistic, its code and a short description.
- process_align(align, positions=None, max_missing=0.0)[source]¶
Analyze an alignment.
- Parameters:
align – an
Align
instance.positions – a list, or other sequence of positions for all sites, or
None
. IfNone
, use the index of each site. Otherwise, must be a sequences of integer values (length matching the number of sites).max_missing –
Maximum relative proportion of missing data (excluding outgroup). The default is to exclude all sites with missing data. Missing data include all ambiguity characters and alignment gaps. Sites not passing this threshold are ignored for all analyses.
Note
Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
- Returns:
A dictionary of statistics, unless multi is set.
- process_freq(frq, position=None)[source]¶
Analyze already computed frequencies.
- Parameters:
frq – a
Freq
instance.position – position of the site. Must be an integer value. By default, use the site loading index.
- Returns:
A dictionary of statistics, unless multi is set.
- process_site(site, position=None)[source]¶
Analyze a site.
- Parameters:
site – a
Site
instance.position – position of the site. Must be an integer value. By default, use the site index.
- Returns:
A dictionary of statistics, unless multi is set.
- process_sites(sites, positions=None)[source]¶
Analyze a list of sites.
- Parameters:
sites – a list, or other sequence of
Site
instance. (VcfWindow
instances with a number of sites greater than 0 are also supported.)positions – a list, or other sequence of positions for all sites, or
None
. IfNone
, use the index of each site. Otherwise, must be a sequences of integer values (length matching the length of sites).
- Returns:
A dictionary of statistics, unless multi is set.
- class egglib.stats.EHH[source]¶
Extended Haplotype Homozygosity statistics. Statistics can be computed for unphased genotypic data.
The usage of this class is to: to first set the core haplotypes by passing a
Site
toset_core()
and then to load repetitively distant sites (always with increasing distance from the core), throughload_distant()
, until the list of sites to process is exhausted or one of the thresholds has been reached.In order to process distant sites using the same core region to the opposite direction, or to use a different core region, it is always required to call
set_core()
again (this is the only way to reset the instance).After at least one distant site site is loaded, EHH statistics can be accessed using there accessors. EHH statistics fell into four categories:
Raw EHH statistics, provided for each loaded distant sites and separately for each core haplotype. See the methods:
get_EHH()
,get_EHHc()
, andget_rEHH()
.Reference
Sabeti P.C., D.E. Reich, J.M. Higgins, H.Z.P. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, H.C. Ackerman, S.J. Campbell, D. Altshuler, R. Cooper, D. Kwiatkowski, R. Ward & E.S. Lander. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832-837.
Integrated EHH statistics, computed separately for each core haplotype and incremented at each provided sites until the EHH value reaches the thresholds provided as the EHH_thr and EHHc_thr arguments to
set_core()
. See the methodsget_iHH()
,get_iHHc()
, andget_iHS()
. The methodsdone_EHH()
anddone_EHHc()
allow to check whether the threshold has been reached for all genotypes.Reference
Voight B.F., S. Kudaravalli, X. Wen & J.K. Pritchard. 2006. A map of recent positive selection in the human genome. PLoS Biol 4: e772.
Whole-site EHHS statistic and its integrated version iES, which is incremented as long that the EHHS value is larger than or equal to the threshold provided as the EHHS_thr argument to
set_core()
. See the methodsget_EHHS()
andget_iES()
. If data are unphased, a specific EHHS estimate based on homozygosity is provided (EHHG, with its iEG integrated counter-part).Reference
Tang K., K.R. Thornton & M. Stoneking. 2007. A new approach for using genome scans to detect recent positive selection in the human genome. PLoS Biol. 5: e171.
EHH, EHHS, and EHHG decay statistics, that give the minimal distance at which, respectively, EHH starts to be smaller than the threshold provided as the EHH_thr argument to
set_core()
, EHHS starts to be smaller than the threshold provided as the EHHS_thr argument toset_core()
, and EHHG starts to be smaller than the threshold provided as the EHHG_thr argument toset_core()
. EHH decay is computed separately for each core haplotype. See the methodsget_dEHH()
,get_dEHHS()
, andget_dEHHG()
. The values are not available until the respective threshold is reached (None
is returned). The methoddone_EHH()
allows to check whether the threshold for the EHH decay has been reached for all core haplotypes. The maximum and average value of the EHH decay across all core haplotypes can be accessed withget_dEHH_max()
andget_dEHH_mean()
, respectively.Reference
Ramírez-Soriano A., S.E. Ramos-Onsins, J. Rozas, F. Calafell & A. Navarro. 2008. Statistical power analysis of neutrality tests under demographic expansions, contractions and bottlenecks with recombination. Genetics 179 : 555-567.
In all cases,
None
is returned when the value is not available (no data loaded, division by zero in the case of ratios, or threshold not reached in the case of decay statistics).The thresholds must all be within the range between 0 and 1.
Attributes
Current number of haplotypes.
Current number of non-missing samples.
Number of core haplotypes.
Methods
done_EHH
()Completed EHH calculation.
Completed EHHc calculation.
get_EHH
(i)Get an EHH value.
get_EHHG
()Get EHHG value.
get_EHHS
()Get EHHS value.
get_EHHc
(i)Get an EHHc value.
get_dEHH
(i)EHH decay.
EHHG decay.
EHHS decay.
Maximum EHH decay.
Average EHH decay.
get_dEHHc
(i)EHHc decay.
get_iEG
()Get iEG value.
get_iES
()Get iES value.
get_iHH
(i)Get an iHH value.
get_iHHc
(i)Get an iHHc value.
get_iHS
(i)Get an iHS value.
get_rEHH
(i)Get a rEHH value.
load_distant
(site)Process a distant site.
nsam_core
(hap)Number of samples for a core haplotype.
nsam_hap
(hap)Number of samples for a haplotype.
set_core
(site[, struct, min_freq, min_sam, ...])Specify the core site.
- property cur_haplotypes¶
Current number of haplotypes.
None
if core has not been set, equal tonum_haplotypes
if no distant site has been loaded.
- done_EHH()[source]¶
Completed EHH calculation. Return
True
if the values of iHH for all core haplotypes have finished integrating (and all dEHH values have been evaluated).
- done_EHHc()[source]¶
Completed EHHc calculation. Return
True
if the values of iHHc for all core haplotypes have finished integrating (and all dEHH values have been evaluated).
- get_EHH(i)[source]¶
Get an EHH value. Get the EHH value for the last processed distant site for core haplotype i. Return
None
if the value cannot be computed (no available samples).
- get_EHHG()[source]¶
Get EHHG value. Get the EHHS (computed with genotypes) value for the last processed distant site. Return
None
if the value cannot be computed (no available samples, no indiv_struct option used).
- get_EHHS()[source]¶
Get EHHS value. Get the EHHS value for the last processed distant site. Return
None
if the value cannot be computed (no available samples).
- get_EHHc(i)[source]¶
Get an EHHc value. Get the EHHc value for the last processed distant site for core haplotype i. Return
None
if the value cannot be computed (no available samples).
- get_dEHH(i)[source]¶
EHH decay. Get the EHH decay distance for core haplotype i. Return
None
if the EHH threshold has not been reached.
- get_dEHHG()[source]¶
EHHG decay. Get the EHHS (computed with genotypes) decay distance. Return
None
if the EHHG threshold has not been reached.
- get_dEHHS()[source]¶
EHHS decay. Get the EHHS decay distance. Return
None
if the EHHS threshold has not been reached.
- get_dEHH_max()[source]¶
Maximum EHH decay. Get the maximum EHH decay distance across core haplotypes. Return
None
if the EHH threshold has not been reached for at least one of the core haplotypes.
- get_dEHH_mean()[source]¶
Average EHH decay. Get the average EHH decay distance across core haplotypes. Return
None
if the EHH threshold has not been reached for at least one of the core haplotypes.
- get_dEHHc(i)[source]¶
EHHc decay. Get the EHHc decay distance for core haplotype i. Return
None
if the EHHc threshold has not been reached.
- get_iEG()[source]¶
Get iEG value. Get the iES (computed with genotypes) value for the last processed distant site. Return
None
if the value cannot be computed (no available samples or indiv_struct option was not used).
- get_iES()[source]¶
Get iES value. Get the iES value for the last processed distant site. Return
None
if the value cannot be computed (no available samples).
- get_iHH(i)[source]¶
Get an iHH value. Get the iHH value for the last processed distant site for core haplotype i . Return
None
if the value cannot be computed (no available samples).
- get_iHHc(i)[source]¶
Get an iHHc value. Get the iHHc value for the last processed distant site for core haplotype i. Return
None
if the value cannot be computed (no available samples).
- get_iHS(i)[source]¶
Get an iHS value. Get the iHS value for the last processed distant site for core haplotype i. Return
None
if the ratio cannot be computed (no available sample or division by zero).
- get_rEHH(i)[source]¶
Get a rEHH value. Get the rEHH value for the last processed distant site for core haplotype i. Return
None
if the ratio cannot be computed (no available sample or division by zero).
- load_distant(site)[source]¶
Process a distant site. The core site must have been specified. The site must have a specified
position
.- Parameters:
site – a
Site
instance containing data for the distant site. It must be consistent with the core site (same list of samples, and in the same order). Missing data are supported.
- property nsam¶
Current number of non-missing samples.
- nsam_core(hap)[source]¶
Number of samples for a core haplotype. Number of non-missing samples for one of the core haplotypes.
- nsam_hap(hap)[source]¶
Number of samples for a haplotype. Number of non-missing samples for one of the current haplotypes.
- property num_haplotypes¶
Number of core haplotypes.
None
if core has not been set.
- set_core(site, struct=None, min_freq=None, min_sam=2, EHH_thr=None, EHHc_thr=None, EHHS_thr=None, EHHG_thr=None, crop_EHHS=False)[source]¶
Specify the core site. If the instance already contains data, it will be reset.
- Parameters:
site – a
Site
instance. The site must have a specifiedposition
. It is assumed that data are phased (samples should be phased; but it is possible to load genotypes whose phase is unknown, as long as the genotypes are phased at the individual level: see indiv_struct).struct – This argument must be set to a non-
None
value if genotypic data are provided and if unphased versions of statistics should be computed. If notNone
, this argument should either be aStructure
object mapping samples to individuals (other structure levels are not considered), or an integer specifying the ploidy (in the latter case, samples of any individuals are supposed to be consecutive in the site. It is still required that individuals are phased (that is, that they are entered in the same order in all distant sites). The integer 0 is accepted as a synonym ofNone
(data will be treated as haploid).min_freq – minimal absolute frequency for haplotypes (haplotypes with lower frequencies are ignored). By default, all haplotypes are considered.
min_sam – minimal absolute number of non-missing samples to compute statistics or stop incrementing them (for statistics expressed per core haplotype, number of samples for this haplotype).
EHH_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (
None
), iHH is permanently incremented and dEHH is not evaluated at all.EHHc_thr – threshold determining when iHH should stop integrating and dEHH must be evaluated. By default (if
None
), use the same value as for EHH_thr.EHHS_thr – threshold determining when iES should stop integrating and dEHHS must be evaluated. By default (
None
), iES is permanently incremented and dEHHS is not evaluated at all.EHHG_thr – threshold determining when iEG (iES computed for unphased data) should stop integrating and dEHHG (equivalent to dEHHS) must be evaluated. Must be
None
if genotypes isTrue
. By default (None
), iEG is permanently incremented and dEHHG is not evaluated at all. By default: equal to EHHS_thr.crop_EHHS – if True, set values of EHHS that are below the threshold to 0 to emulate the behaviour of the R package rehh (also affects iES).
- egglib.stats.pairwise_LD(locus1, locus2, struct=None, multiple_policy='main', min_freq=0)[source]¶
Linkage disequilibium between a pair of loci.
Site
instances must be used to describe loci. Only the order of samples is considered and both loci must have the same number of samples.- Parameters:
locus1 – A
Site
instance.locus2 – A
Site
instance.struct – A
Structure
object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.multiple_policy – Determine what is done if either input locus has more than two alleles. Possible values are,
"forbid"
(raise an exception if this occurs),"main"
(take the most frequent allele of each locus) and"average"
( compute the unweighted average over all possible pair of alleles). More options might be added in future versions. This option is ignored if both loci have less than three alleles. If"main"
and there are several equally most frequent alleles, the first-occurring one is used (arbitrarily).min_freq – Only used if at least one site has more than two alleles and multiple_policy is set to average. Set the minimum absolute frequency to consider an allele.
- Returns:
A
dict
of linkage disequilibrium statistics. In case statistics cannot be computed (either site fixed, or less than two samples with non-missing data at both sites), computed values are replaced byNone
.n
gives the number of pairs of alleles considered (0 if statistics are not computed at all).
- egglib.stats.matrix_LD(align, stats, struct=None, multiple_policy='main', min_freq=0, min_n=2, max_maj=1.0, positions=None)[source]¶
Linkage disequilibrium statistics between all pairs of sites. The computed statistics are selected by an argument of this function. Return a matrix (as a nested
list
) of the requested statistics. In all cases, all pairs of sites are present in the returned matrices. If statistics cannot be computed, they are replaced byNone
.The available statistics are:
d
– Distance between sites of the pairs.D
– Standard linkage disequilibrium.Dp
– Lewontin’s D’.r
– Correlation coefficient.rsq
– Equivalent to r2.
- Parameters:
align – An
Align
instance.stats – Requested statistic or statistics (see list of available statistics above, as a single string or as a list of one or more of these statistics (in any order).
struct – A
Structure
object used to describe the individual level. If specified, genotypes are automatically described and genotypic linkage is processed. By default, consider all samples as originating from haploid individuals.multiple_policy – Specify what is done for pairs of sites for which at least one locus has only one allele. See
pairwise_LD()
for further description.min_freq – Only used if at least one site has more than two alleles and depending on the value of multiple_policy. See
pairwise_LD()
for further description.min_n – Minimum number of samples used (this value must always be larger than 1). Sites not fulfilling this criterion will be dropped.
max_maj – Maximum relative frequency of the majority allele. Sites not fulfilling this criterion will be dropped.
positions – A sequence of positions, whose length must match the number of sites of the provided alignment. Used in the return value to describe the used sites, and, if requested, to compute the distance between sites. By default, the position of sites in the original alignment is used.
- Returns:
A
tuple
with two items: first is thelist
of positions of sites used in the matrix (a subset of the sites of the provided alignment), with positions provided by the corresponding argument (by default, the index of sites); second is the matrix, as the nested lower half matrix. The matrix contains items for all i and j indexes with 0 <= j <= i < n where n is the number of retained sites. The content of the matrix is represented by a single value (if a single statistic has been requested) or as a list of 1 or more values (if a list of 1 or more, accordingly, statistics have been requested), orNone
for the diagonal or if the pairwise comparison was dropped for any reason.
- class egglib.stats.ProbaMisoriented(align=None, struct=None)[source]¶
Error rate of polymorphism orientation using an outgroup.
Only sites that are either variable within the ingroup or have a fixed difference with respect to the outgroup are considered. Sites with more than two different alleles in the ingroup, or more than one allele in the outgroup, are ignored.
This function is an implementation of the method mentioned in Baudry and Depaulis (2003), allowing to estimate the probability that a site oriented using the provided outgroup have be misoriented due to a homoplasic mutation in the branch leading to the outgroup. Note that this estimation neglects the probability of shared polymorphism.
If the instance is created with an alignment as constructor argument, then the statistics are computed. The method
load_align()
does the same from an existingProbaMisoriented
instance. Otherwise, individual sites can be loaded withload_site()
, and then the statistics can be computed usingcompute()
(the latter is preferable if generateFreq
instances for an other use).Reference
Baudry, E. & F. Depaulis. 2003. Effect of misoriented sites on neutrality tests with outgroup. Genetics 165: 1619-1622.
New in version 3.0.0.
Attributes
Number of loaded sites with a fixed difference to the outgroup.
Number of loaded polymorphic sites.
Transition and transversion rate ratio.
Probability of misorientation.
Methods
compute
()load_align
(align, struct)Load all sites of align that meet criteria.
load_site
(freq)Load a single site.
reset
()Clear all loaded or computed data.
- property D¶
Number of loaded sites with a fixed difference to the outgroup.
- property S¶
Number of loaded polymorphic sites. Only the ingroup is considered to classify a site as polymorphic.
- property TiTv¶
Transition and transversion rate ratio.
None
if the value cannot be computed (no loaded data or null transversion rate). Requires thatcompute()
has been called.
- compute()[source]¶
Compute
pM
andTiTv
statistics. Requires that sites have been loaded usingload_site()
. This method does not reset the instance.
- load_align(align, struct)[source]¶
Load all sites of align that meet criteria. If there are previously loaded data, they are discarded. This method computes statistics. Data are required to be DNA sequences.
- load_site(freq)[source]¶
Load a single site. If there are previously loaded data, they are retained. To actually compute the misorientation probability, the user must call
compute()
.- Parameters:
freq – a
Freq
instance.
Warning
The origin of data must be DNA sequences. This is currently not checked here.
- egglib.stats.haplotypes_from_align(aln, max_missing=0.0, impute_threshold=0, struct=None, dest=None, multiple=False)[source]¶
Identification of haplotypes from an alignment. Identify haplotypes an
Align
instance and return data as a singleSite
instance containing one sample for each sample of the original data. Alleles in the returned site are representing all identified haplotypes (or missing data when the haplotypes could not be derived).Note
There must be at least one site with at least two alleles (overall, including the outgroup), otherwise the produced site only contains missing data.
- Parameters:
aln – an
Align
instance.impute_threshold – by default, all samples with a least one occurrence of missing data will be treated as missing data. If this argument is more than 0, the provided value will be used as maximum number of missing data. All samples with as many or less missing data will be processed to determine which extant of haplotype they might belong (to which they are identical save for missing data). If there is only one such haplotype, the corresponding samples will be treated as a repetition of this haplotype. This option will never allow detecting new haplotypes. Only small values of this option make sense.
struct – a
Structure
instance defining the samples to process. The population and cluster structures are not used. If the ploidy is larger than 1, the individuals are used, and sites are assumed to be phased.max_missing –
maximum relative proportion of missing data in the ingroup to process a site.
Note
Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
dest – a
Site
instance that will be reset and in which data will be placed. If specified, this function returns nothing. By default, the function returns a newSite
instance.multiple – allow sites with more than two alleles in the ingroup.
- Returns:
A
Site
instance, (if dest isNone
) orNone
(otherwise).
Note
If an empty list of sites is provided and no structure is provided, a
ValueError
is raised. To support smoothly possibly empty lists, the used is prompted to provide a structure.
- egglib.stats.haplotypes_from_sites(sites, impute_threshold=0, struct=None, dest=None, multiple=False)[source]¶
Identification of haplotypes from a list of sites. Similar to
haplotypes_from_align()
but takes a list ofSite
instances. No option max_missing is available; all sites are always considered.
- class egglib.stats.CodingDiversity(*args, **kwargs)[source]¶
Detection of synonymous and non-synonymous sites. This class processes alignments with a reading frame specification in order to detect synonymous and non-synonymous variable positions. It provides basic statistics, but it can also filter data to let the user compute all other statistics on synonymous-only, or non-synonymous-only variation (e.g. \(\pi\) or
D
).The constructor takes optional arguments. By default, build an empty instance. If arguments are passed, they must match the signature of
process()
that will be called.Note that codons with missing data are never considered, even if the resolution of all possibilities consistent translate to the same nucleotide. This applies to stop and start codons. Alternative start codons are not considered either.
Attributes
Number of analysed codon sites.
Number of codon sites with at least one codon stop.
Total number of considered codon sites.
Number of polymorphic coding sites with more than two alleles.
Number of polymorphic codons for which more than one position is changed.
Number of polymorphic codon sites with only non-synonymous variation.
Number of polymorphic codon sites with only synonymous variation.
Number of polymorphic coding sites with only one mutation.
Estimated number of non-synonymous sites.
Estimated number of synonymous sites.
List of positions of sites with only non-synonymous variation
List of positions of sites with only synonymous variation
List of sites with only non-synonymous variation
List of sites with only synonymous variation
Methods
process
(align[, code, struct, max_missing, ...])Process an alignment.
- property num_codons_eff¶
Number of analysed codon sites. Like
num_codons_tot
but excluding sites rejected because of missing data.
- property num_codons_stop¶
Number of codon sites with at least one codon stop.
- property num_codons_tot¶
Total number of considered codon sites. Only complete codons have been considered, but this value includes codons that have been rejected because of missing data.
- property num_multiple_alleles¶
Number of polymorphic coding sites with more than two alleles. These sites are included only if multiple_alleles is
True
except those who mix synonymous and non-synonymous changes (they can be rejected if there are more than two alleles in total as well).
- property num_multiple_hits¶
Number of polymorphic codons for which more than one position is changed. These sites are included only if multiple_hits is
True
and depdenting on the total number of alleles.
- property num_pol_NS¶
Number of polymorphic codon sites with only non-synonymous variation.
- property num_pol_S¶
Number of polymorphic codon sites with only synonymous variation.
- property num_pol_single¶
Number of polymorphic coding sites with only one mutation. All these sites are always included.
- property num_sites_NS¶
Estimated number of non-synonymous sites. Note that the total number of sites per codon is always 3.
The numbers of non-synonymous and synonymous sites are estimated using the method of Nei & Gojobori (Mol. Biol. Evol. 1986 3:418-426).
- property num_sites_S¶
Estimated number of synonymous sites. Note that the total number of sites per codon is always 3.
- property positions_NS¶
List of positions of sites with only non-synonymous variation
- property positions_S¶
List of positions of sites with only synonymous variation
- process(align, code=1, struct=None, max_missing=0.0, skipstop=True, raise_stop=False, multiple_alleles=False, multiple_hits=False)[source]¶
Process an alignment. It this instance already had data in memory, they will all be erased.
- Parameters:
align – an
Align
instance containing the coding sequence to process. The alphabet must be codons.code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
struct – a class:Structure object specifying samples to include and outgroup samples to skip for analysis (outgroup samples are ignored to determine if site is polymorphic, coding, or non-coding, but included in generated sites). By default, use all samples.
max_missing –
maximum relative proportion of missing data (per codon site) to allow (including stop codons if skipstop if
true
). By default, all codon sites with any missing data are excluded. Outgroup is not considered.Note
Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
skipstop – if
True
, stop codons are treated as missing data and skipped. If so, potential mutations to stop codons are not taken into account when estimating the number of non-synonymous sites. Warning (this may be counter-intuitive): it actually assumes that stop codons are not biologically plausible and considers them as missing data. On the other hand, if skipstop isFalse
, it considers stop codons as if they were valid amino acids. This option has no effect if raise_stop isTrue
.raise_stop – raise a
ValueError
if a stop codon is met. IfTrue
, skipstop has no effect. Outgroup is not considered.multiple_alleles – include coding sites with more than two alleles (regardless of whether mutations hit the same position within the triplet). If there are more than two different codons, they must either encode for the same amino acid or all encode for different amino acids (otherwise the site is rejected). If more than one of the three codon position are mutated, the option multiple_hits is considered.
multiple_hits – include coding sites for which more than one of the three positions has changed (regardless of the number of alleles). If there are more than two alleles, the option multiple_alleles is also considered.
- property sites_NS¶
List of sites with only non-synonymous variation
- property sites_S¶
List of sites with only synonymous variation
- egglib.stats.paralog_pi(align, struct_p, struct_i, max_missing=0.0)[source]¶
Compute diversity statistics for a gene family. Based on Innan (Genetics 2003 163:803-810). An estimate of genetic diversity is provided for every paralog and for every pair of paralogs, provided that enough non-missing data is available (at least 2 samples are required). Note that sites with more than two alleles are always considered.
- Parameters:
align – an
Align
containing the sequence of all available paralog for all samples. The outgroup is ignored.struct_p – a
Structure
providing the organisation of sequences in paralogs. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all individuals for a given paralog should be grouped together in that structure. There might be a different number of sequences per paralog due to missing data.struct_i – a
Structure
providing the organisation of sequences in individuals. This structure must have a ploidy of 1 (no individual structure). Clusters, if defined, are ignored. The sequences of all paralogs for a given individual should be grouped together in that structure. There might be a different number of sequences per individual due to missing data.max_missing –
maximum relative proportion of missing data (if there are more missing data at a site, the site is ignored altogether).
Note
Here, max_missing is a relative proportion to allow using the same value for different alignments that might not have the same number of samples (to avoid reevaluating max_missing if the user wants the same maximum rate of missing data). In other functions, max_missing is the maximum number of missing and is an integer.
- Returns:
A new
ParalogPi
instance which provides methods to access the number of used sites and the diversity for each paralog/paralog pair.
- class egglib.stats.ParalogPi[source]¶
Compute diversity statistics for a gene family. See
paralog_pi()
for more details. This class can be used directly (1) to analyse data with more efficiency (by reusing the same instance) or (2) to combine data from different alignments, or (3) for pass individual sites. Do first callsetup()
.Methods
Pib
(i, j)Between-paralog diversity for paralogs i and j.
Piw
(i)Within-paralog diversity for paralog i.
num_sites
([i[, j]])Number of sites with data.
process_align
(aln)Process an alignment.
process_site
(site)Process a site.
setup
(struct_p, struct_i[, max_missing])Specify the structure in paralog and individuals.
- num_sites([i[, j]])[source]¶
Number of sites with data. Number of sites: with any data (without arguments), with data for paralog i (if only i specified), with data for the pair of paralogs i and j (if both specified). This counts the number of sites which have not been excluded based on the num_missing argument, and which have at least one pair of samples for the considered sample.
- process_align(aln)[source]¶
Process an alignment. The alignment must match the structure passed to
setup()
. Diversity estimates are incremented (no reset).- Parameters:
aln – an
Align
instance.
- process_site(site)[source]¶
Process a site. The site must match the structure passed to
setup()
. Diversity estimates are incremented (no reset).- Parameters:
site – a
Site
instance.
- setup(struct_p, struct_i, max_missing=0.0)[source]¶
Specify the structure in paralog and individuals. The arguments are as as described for
paralog_pi()
. Only this method resets the instance.