EggLib version 2.1 is archived.

Table Of Contents

Previous topic

Download and install

Next topic


This Page


This section provides primers on the usage of the Python module and of the executable commands. In the following, representative examples of the most common uses of Egglib will be addressed, assuming that the user aims at developing simple scripts (that is, without extensive error checking).

Import/export of sequence data

Egglib relies heavily on two classes: sequence sets (that are not considered as aligned), Container, and sequence alignments (for which all sequences must have the same lengths), Align. Many operations are available as methods of these two classes (including polymorphism analysis).


There is no sequence validity check at the level of storage classes. The parser only skips all white spaces and imports characters as upper cases, but it accepts all characters, and all characters are allowed when setting sequences within Python. Checking is performed at latter stages.

Import from FASTA-formatted files or strings

Egglib is based primarily on Pearson’s FASTA format for data input, due to its simplicity and the fact that it can be generated by many other software. In Python, the constructor of both Container and Align includes a parser that can take a file name. Creating a sequence set/alignment objects is then concise. Assuming the file data1.fas contain a sequence alignment formatted as follows:


The alignment can be imported as follows:

>>> import egglib
>>> align = egglib.Align("data1.fas")

The constructor of Align requires that all sequences have the same length. For non-aligned data (sequence sets), use the Container class instead. It is also possible to create an instance from a fasta-formatted string loaded in memory, as in the example below (this let you modify the string, if needed):

>>> f = open("data1.fas")
>>> fasta =
>>> f.close()
>>> align = egglib.Align(string=fasta)

Import population information as group label

Both Container and Align can import group labels, which are special tags present in the sequence names. This feature is disabled by default (because such tags are not defined in the FASTA format). Note that group labels are always present in Container and Align instances and that they are initialized to the default value 0 if the option described hereafter is not activated.

You can activate group label detection by using the flag option groups of the constructors. In such case, do avoid to use @ anywhere in sequence names. If names in your imported file or string contains labels such as @0, @1, @42 (@ followed by an integer, either at the end of the string or followed by a space), they will be recognized. Assume that data2.fas contains such labels, as reproduced below (sequences have been omitted):

>AY446410 @18
>AY446411@4 @6
>AY446413@7 @X
>AY446414@7 @0
>AY446415@42 POP42

Some labels have been deliberately placed in strange positions or repeated. Now you can import the data and check how the names have been processed (the iterator will be addressed in the next section):

>>> align = egglib.Align("data2.fas", groups=True)
>>> for name,sequence,group in align:
>>>     print 'name:',name, 'group:',group

results in:

name: AY446407 group: 0
name: AY446408 group: 0
name: AY446409 group: 1
name: AY446410  group: 18
name: AY446411 group: 46
name: AY446412 group: 0
name: AY446413 group: 7
name: AY446414 group: 1
name: AY446415POP42 group: 42
name: AY446416 group: 8

Note first that the names have been modified: the group index tags have been removed from the name and, in the case of malformed, repeated tags or tags not placed at the end of the sequence, it will not be possible to generate the exact same file anymore (the last sequence, for example, will be AY446416@8. Note also that the group labels are really labels (not all values need to be used, and the order is not relevant). Also, importantly, group labels are always integers.

  • AY446407: This has been interpreted as AY446407@0 (the value 0 is the default).
  • AY446408@0: The group label is 0.
  • AY446409@1: The group label is 1.
  • AY446410 @18: The group label is 18. Note that many values have been skipped. It will never be assumed that populations 2 through 17 exist. Note also that there is a space between the name itself and the tag, and that the space is still present in the sequence name: the name really is "AY446410 ". This is error-prone and should be avoided.
  • AY446411@4 @6: Two labels have been entered, and they have been concatenated to form the integer 46. This syntax is not recommended.
  • AY446412@X: The label is not integer: the data was lost.
  • AY446413@7 @X: One of the labels is not an integer, this was lost (avoid using repeated tags).
  • AY446414@7 @0: again a repeated tag. Note that, in the previous example, @X really was ignored (it was not changed into 0).
  • AY446415@42 POP42: Additional information is added after the group label. One the tag and the space after it were removed. This syntax is not recommended.
  • AY446416@8POP8: the space after the group label tag was omitted. Data has been lost.


When using groups=True when importing fasta data, always place group label tags at the end of the name without space. It will prevent lots of errors and allow consistent input/output operations (the input name can be generated again when exporting), which is the preferable behaviour.


The group label 999 (entered as @999 in input file) has a special meaning: it identifies an outgroup sequence. It is possible to enter several outgroup sequences.

Conversion to Container and Align types

The create() method can be called on either Container or Align classes to create corresponding instances. create() expects as argument an object of any iterable type, provided that the iterations return either (name, sequence) or (name, sequence, group) tuples. Note that both Container and Align are compatible. Therefore, create() can be used to perform a deep copy of sequence sets/alignments or convert one type into another:

>>> copy = egglib.Align.create(align)
>>> container = egglib.Container.create(align)

Convert from Container to Align is unlikely to be needed. Note that conversion from Container to Align is only possible when all sequences have the same length, which can be fixed either using alignment utilities or using the equalize() method.

It is also possible to use create() to create an instance from an ad hoc object (provided that iterations returns two- or three-item tuples:

>>> align = egglib.Align.create(sequences)

Remember that sequences must have the same length to be imported as an Align:

>>> container = egglib.Container.create(sequences)
>>> print 'Container created, length:', len(container)
>>> align = egglib.Align.create(sequences)
>>> print 'Align created, length:', len(align)

results in:

Container created, length: 2
Traceback (most recent call last):
    ... traceback omitted ...
ValueError: Sequence doesn't match the alignment length

Exports sequence data

Default format for sequence data is also Pearson’s FASTA format. FASTA strings are automatically generated using the print statement (the align object used here is the same as the last but one example):

>>> print align

results in:


This doesn’t allow to export group labels. To do so, use the str() method, which take a exportGroupLabels option (and is available on both Container and Align):

>>> string = align.str(exportGroupLabels=True)
>>> print string

results in:


To export data directly to a file, use the write() method (also available on both Container and Align). This method calls underlying C++ code and, on large datasets, is far more efficient than generating the full string and placing it to a file. The below example generates a file containing the string of the last example:

>>> align.write("data3.fas", True)

In particular in the case of Align instances, output in alternative formats can be performed using nexus(), phylip() and phyml() methods. Conversely, a few alternative alignment formats are available for input in the tools module (see the section “Specific data formats”).

Manipulation of sequence sets and alignments


Iteration is frequently useful, in particular for sequence data. When you iterate over a Container or a Align instance, each iteration step yields a SequenceItem object. You should never need to instanciate a SequenceItem yourself, and using it should be as simple as accessing it three attributes name, sequence and group. This class is designed to allow access/modify the instance without necessarily copy the sequence data. In the current implementation, the sequence data will still be extracted if sequence is accessed, however. Note that SequenceItem instances keep a record of their parent Container or Align instance and that, preferably, they should not be stored outside the loop.

Assume you have a FASTA file data4.fas with an alternative way of coding population membership (underscore instead of @):


Here is how to import these labels and modify the underlying instance:

>>> align = egglib.Align("data4.fas")
>>> for i in align:
...    root, label ='_')
...    label = int(label)
... = root
... = label
>>> align.write("data5.fas", True)

This generates the following file:


The syntax for name, sequence, group in align is supported but has two flaws: 1) it will cause a performance overhead in case sequences are long and/or numerous and that they don’t need to be accessed, and 2) this doesn’t allow to modify the underlying instances, since name, sequence and group variables are strings and integer, and therefore not modifiable.


When setting the sequence attribute of a SequenceItem, the new sequence must match the length of the alignment.

Sequence data edition

It is possible to access (read or modify) full data at a given position using subscript indexing ([]). The operator takes or returns a three-item tuple:

>>> print align[0]
>>> print align[0]
>>> print align[0]

results in:


Note that, if the group label is omitted, the value of 0 is assumed and replaces the previous value.

It is possible to use negative indices, as in align[-1]. The del statement is allowed for removing a full entry (name, sequence and group) as in del align[2]. To remove a sequence based on its name, use remove().

It is never allowed to access indices equal to or larger than the current length of the sequence set or alignment, including to add a new sequence (see below). The methods set() and get() allow to access a single character at a specific position. Note also that the overloaded methods name(), sequence() and group() allow to access/modify only one of these attribute directly at a given position. sequenceByName() allows to get a sequence string using the name and not the index.

It is possible to add a stretch a sequence to a sequence at a given position, on Container only, using the method appendSequence().

The method append() allows to add a new sequence at the end of the sequence set or alignment, and the method addSequences() allows to do it repeatively.

On a Align instance, the methods column() and removePosition() allow to (respectively) access and remove a full column.

Finally, the code below allow to extract the first three sequences of an alignment, and only the first 10 and the last 10 positions:

>>> part = align.slice(0, 3)
>>> part = part.extract(range(10)+range(-10,-1))
>>> print part

results in:


Manipulation of other data types (trees, annotated sequences, SSR)

Phylogenetic trees are implemented as the class Tree and GenBank flatfile data as the class GenBank. Both can also be instanciated from files and strings. Similarly to sequence sets and alignments, iteration over instances of these two classes yield ad hoc class instances allowing to manage, respectively, a node of the tree and a feature. Available methods on trees allow to edit the them (such as rooting) and explore them (such as identifying monophyletic groups), while methods on GenBank flatfile data allow to access and edit annotation information and extract data (such as coding sequence strings). The SSR class is designed to import, export and compute statistics on microsatellite data.

Analysis of polymorphism

Function to compute polymorphism are available as methods of the Align and SSR classes:

  • polymorphism(): compute standard statistics, neutrality indices, and differentiation statistics (if applicable).
  • polymorphismBPP(): compute statistics available through the Bio++ library (including statistics based on nonsynonymous/synonymous variation, if applicable). Bio++ must have been available when installing EggLib to allow using this function.
  • stats(): compute microsatellite statistics.
  • Fstats(): compute F-statistics (based on Weir and Cockerham formulas) for microsatellite.

Note that the first three methods return dictionaries and that, in the case of the first two methods, the content of these dictionaries will depend on both the value of options (some statistics can be skipped) and on the content of the instance (differentiation statistics are not computed if only one population is present). Please refer to the documentation of these methods for more details.

Coalescent simulations

The structure of the simul module (see its description) is designed as a compromise between flexibility (make available all features of the underlying C++ library), usability (not being too clumsy with a large number of options in a single function) and simplicity (reduce the number of required steps). The below example shows how to perform a given number of simulations of a moderately complex model (two populations that of which one is the result of a founding event from the first one, and a standard model of mutation with an infinite number of sites). Note that many more elaborate scenarios and mutation models are available.

>>> # sets parameters
>>> sample1 = 40   # number of sample in pop 1
>>> sample2 = 40   # number of sample in pop 2
>>> date1 = 0.20   # end of bottleneck
>>> date2 = 0.25   # split date/start of bottleneck
>>> strength = 0.1 # strength of the bottleneck
>>> migr = 0.0     # migration rate (default is 0.1)
>>> theta = 2.     # theta for gene
>>> nrepets = 1000 # number of simulations
>>> # sets the two parameter classes
>>> paramSet = egglib.simul.CoalesceParamSet([sample1, sample2], M=migr)
>>> paramSet.changeSinglePopulationSize(date1, 1, strength)
>>> paramSet.populationFusion(date2, 0, 1)
>>> mutator = egglib.simul.CoalesceFiniteAlleleMutator(theta=2.)
>>> # performs simulations
>>> aligns = egglib.simul.coalesce(paramSet, mutator, nrepets)
>>> # computes statistics
>>> pol_data = map(egglib.Align.polymorphism, aligns)
>>> D = [i['D'] for i in pol_data]
>>> Fst = [i['Fst'] for i in pol_data]
>>> print 'average D:', sum(D)/nrepets
>>> print 'average Fst:', sum(Fst)/nrepets

Note that is is not necessary to formally end the bottleneck by paramSet.changeSinglePopulationSize(date0, 1, 1.) since this population is removed at the same date.

Here’s the result of a particular run:

average D: -0.557003767734
average Fst: 0.373726230187

Interactive usage and Approximate Bayesian Computation

In this section we describe the command line tools available through EggLib. A number of short programs have been written under the form of subclasses of BaseCommand in the module utils. The egglib script (which is installed at the same tyme as the EggLib Python module), lets you run automatically these programs. In this section we present some generalities on the general usage of the egglib script, and then provide an example of usage with a tutorial for ABC analyses using such commands. Note that all commands presented below are system commands (entered through a command terminal) and not Python code.

The egglib script

The general usage of the egglib script is:

$ egglib <command> <options>

where command is the name of any of the available programs. An alternative syntax is available as python -m egglib.utils <command> <options>

By typing

$ egglib

only, a list of all commands will be displayed. By typing the command name only (without options), you will be provided with the manual page of this command. For example in the case of one of the simplest commands, infos:

$ egglib infos

Concerning options, they are variables among commands (the command’s manual page always provide the list and documentation). It is important to note that, even if the syntax is standardized over different commands, they really are different programs with different syntaxes (see however the quiet and debug boolean flags below. Options can be of two types:

  • Keyword arguments. They are keyword-identified values, entered in the syntax keyword=value where keyword is one of the available options and value the value you wish to enter. They are usually some restrictions on the format of the value and some options are flexible. In some cases (especially for values that contain characters that can be mistaken with command interpreter modifiers such a spaces and semicolons) it can be necessary to wrap the option in quotes. Some keyword arguments have a default value, and some have not. The latter must be entered to run the command. The input option is found in most commands, and frequently expects a file name as in input=file (allowing sometimes to enter multiple file names as in input=file1,file2,file3), but in a few cases it expects the name of a directory containing files to process.

  • Boolean flags. They are entered as a single token and they activate a specified feature. They are never required and are always off by default (but note that some flags have a negative effect, such a no_check in command cprimers). Two flags are available automatically for all commands:

    • quiet: If this flag is activated, the command will run without console output. Note that the flag has not effect on commands that don’t generate console output anyway, as well as commands whose only task is to generate console output (this is the case for the infos command, for example).
    • debug: By default, all errors are intercepted by the egglib script and summarized in a one-line error statement. The flag debug allows to output the full error traceback (this is useful for debugging and reporting errors).

Options (keyword arguments as well as boolean flags) can be entered in any order, but cannot be repeated. It is always required to enter at least one keyword argument (sometimes, several must be entered).

Here is the end of the example with the infos command. Here is how to get information on the data4.fas:

$ egglib infos input=data4.fas
 ... 6 sequences
 ... alignment
 ... length: 38

And how to get information on several files (data2.txt contains names but no sequences):

$ egglib infos input=data1.fas,data2.fas,data3.fas,data4.fas
 ... 10 sequences
 ... alignment
 ... length: 160
 ... 10 sequences
 ... alignment
 ... length: 0
 ... 6 sequences
 ... alignment
 ... length: 38
 ... 6 sequences
 ... alignment
 ... length: 38

The default error statement (the file data6.fas doesn’t exist):

$ egglib infos input=data6.fas
An error occurred: [ValueError] cannot open data6.fas

And how to get more information:

$ egglib infos input=data6.fas debug
An error occurred: [ValueError] cannot open data6.fas
Traceback (most recent call last):
  File "/usr/bin/egglib", line 67, in <module>*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/egglib/", line 209, in run
    self._run(*fargs, **kwargs)
  File "/usr/lib64/python2.7/site-packages/egglib/", line 3386, in _run
    container = data.Container(fname)
  File "/usr/lib64/python2.7/site-packages/egglib/", line 767, in __init__
    if not os.path.isfile(fname): raise ValueError, 'cannot open %s' %fname
ValueError: cannot open data6.fas

(Of course in that case the source of the error can be identified with the message alone. In that case, the exception has been thrown by the Align constructor when the file was not found.)

ABC sampling

In this example we will consider a simulated dataset of 10 independent alignments of 1000 nucleotides, simulated under a model containing 2 isolated populations of equal size and connected with a bidirectional migration rate (4Nm) of 0.5, with a mutation rate (4Nu) of 3.5 for gene (that is 0.0035 per site). 20 sequences were sampled in each population. We will see if it is possible to reject the standard neutral model (SNM) in favour of the (real) island model with, in that case, two demes (IM) and if it is possible to correctly estimate the parameters.

We will first sample random parameters in uniform priors under these two alternative models (SNM and IM) and perform one simulation for each sample (one simulation contains the 10 loci). The command to do so is:

egglib abc_sample dir=fas params=SNM.txt data=SNM.out model=SNM prior="%U(0;0.01)" stats=SFS:4

The following arguments were entered:

  • dir: The name of a directory containing fasta files to analyze.
  • params: The name of the run parameter output file.
  • data: The name of data output file.
  • model: The label of the demographic model, here the Standard Neutral Model.
  • prior: The definition of the prior. Here, the prior is defined as a string (starting with a % character to indicate that the argument is not a file name), and specifies a uniform distribution bound between 0 and 0.01 (the SNM model has only one parameter).
  • stats: The set of summary statistics, here the Site Frequency Spectrum. This particular set of summary statistics expects an argument (the number of classes in the spectrum), which is passed after the : character.

The following parameters were left to the default values:

  • ext (fas): This argument allows to set the extension of fasta files to process.
  • post (10000): The number of point to sample for posterior estimation.
  • step (100): The rate of refreshing of the progress information.
  • seeds (automatic): Random number generator seeds.
  • restart: This option should be used to resume an interrupted. It takes a run parameter file (the file named by the params option) and will attempt to finish the run. No other options should be entered along with restart.

ABC rejection-regression procedure

The abc_fit command performs the steps of the standard rejection-regression procedure described in Beaumont et al. 2002 Genetics using the underlying C++ class ABC:

egglib abc_fit input=SNM.txt tolerance=0.1 transform=tan

Here we filled all arguments. The input file is SNM.txt which actually contains only the values of the run parameters. The actual data file is given as one of the parameters and must be found in the same directory. The tolerance gives the proportion of points to select in the region closest to the observed statistics. This defines the local region and, in practice, should be more stringent than in this example.

The fitting procedure can be applied to data from a unique model only.

The transformation of parameter values (here, tangential transformation) is useful in particular to prevent parameters from exceeding prior bounds. The output file will contain only parameter values, selected and corrected, and will be considered as a posterior distribution.

Comparing ABC models

The rejection-based method of Fagundes et al. 2007 PNAS allows to compare alternative model without relying on Bayes Factors. It is based on the rejection step of the rejection-regression procedure, which doesn’t use the parameter values and therefore allows to merge several models. The model probabilities are estimated as the proportion of points originating from each model in a pre-defined number of accepted points:

egglib abc_compare input=SNM.txt,PEM.txt tolerance=0.1 output=compare.txt

The different sample files are passed in the input argument. The tolerance argument is the same as for the fitting procedure, but note that here it applies to the sum of points over all the passed models. The output file will contain the accepted points, along with their weigths. The model probabilities will be displayed in the standard output (the terminal).

Analyzing posterior distribution

A set of commands are designed to help analyzing posteriors, after they have been fitted.

  • abc_bin discretizes posteriors into a regular grid over all parameter dimensions.
  • abc_statsmarg provide statistics for individual parameters independently, from a fitted posterior.
  • abc_statsdisc (for models with more than one parameter), analyzes a discretized posterior (and, in particular, identifies the maximum density point of the distribution).
  • abc_plot1D and abc_plot2D display graphical representations parameters (respectively, a single parameter and the covariation of two parameters). They require that matplotlib is available and they use discretized posteriors.
  • abc_psimuls, finally, allows to use posteriors to generate new simulations for testing neutrality of loci or assess goodness-of-fit of models.
Hosted by  Get seqlib at Fast, secure and Free Open Source software downloads