Diversity statistics

In the module stats, a number of tools are provided to compute diversity statistics out of Site or Align instances. Some statistics are applicable to individual sites, some to sets of sites, and some to phased sequences alignments. Note that the objects may indifferently contain nucleotide sequences, protein sequences, microsatellite alleles encoded (or not) as allele length, or any arbitrary representation of allelic diversity.

The alphabets define lists of alleles and their representation, but have not influence regarding what statistics can be computed or not. What is important to note that EggLib will compute any statistic you request out of your data, even if it is meaningless. Special attention should be granted to statistics requiring a phase, since you can easily load unphased data to objects that can be used to compute those statistics.

In many cases, not computable statistics are returned as None, but this is only when they are technically not computable (due to missing data or unvailability of a specific feature such as outgroup sequences or subpopulations).

In the sections of this chapter, we will present statistics available in the stats module. Statistics will be grouped by families (a family of statistics being a group of statistics that require the same type of data and the same kind of information). Most of the statistics are computed by stats.ComputeStats (see this tutorial section for an introduction) and the others by other functions available in the same module.

Outgroup

Some of the statistics require an outgroup to be computed. The outgroup should be included in the analysed dataset (Site or Align instance) and identified by the means of a Structure instance. There might be more than one outgroup samples. The ougroup information will be used to identify the ancestral variant (that is, the one which is shared with the outgroup) if the outgroup has one of the alleles present in the main sample (the ingroup), this allele will be considered to be ancestral. If there are several outgroup samples, all of them are expected to have the same allele (if they are non-missing at this position). If the outgroup has an allele not found in the outgroup, or if the outgroup contains several alleles, then the site will be considered not orientable and won’t be used for statistics requiring an outgroup. Statistics not requiring an outgroup will be computed normally, though.

Population structure

Many statistics require that several populations are present, some require that an individual structure is defined, and one statistic (FisctWC) clusters of populations in addition to populations and individuals. Like the outgroup, the structure of samples is described by Structure instances (see here for an introduction). If the appropriate level of structure is not defined in the Structure provided to the class or function computing statistics (or if no Structure is provided), the concerned statistics will be None.

Here is the list of families of statistics that are described in the following sections: