.. _manual:

======
Manual
======

This section provides primers on the usage of the Python module and of
the executable commands. In the following, representative examples of
the most common uses of Egglib will be addressed, assuming that the user
aims at developing simple scripts (that is, without extensive error
checking).

Import/export of sequence data
==============================

Egglib relies heavily on two classes: sequence sets (that are not
considered as aligned), :class:`~egglib.Container`, and sequence
alignments (for which all sequences must have the same lengths),
:class:`~egglib.Align`. Many operations are available as methods of these
two classes (including polymorphism analysis).

.. note:: There is no sequence validity check at the level of storage
          classes. The parser only skips all white spaces and imports
          characters as upper cases, but it accepts all characters, and
          all characters are allowed when setting sequences within
          Python. Checking is performed at latter stages.

--------------------------------------------
Import from FASTA-formatted files or strings
--------------------------------------------

Egglib is based primarily on Pearson's FASTA format for data input, due
to its simplicity and the fact that it can be generated by many other
software. In Python, the constructor of both :class:`~egglib.Container`
and :class:`~egglib.Align` includes a parser that can take a file name.
Creating a sequence set/alignment objects is then concise. Assuming the
file ``data1.fas`` contain a sequence alignment formatted as follows:

::

    >AY446407
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    --------CACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446408
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    --------CACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446409
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    --------CACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446410
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    --------CACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446411
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    --------CACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446412
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446413
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446414
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446415
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA--
    ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC
    >AY446416
    GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA
    GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACAC-
    -ACACACACACACACACGTGGCACTTTCTTCCTGCGGCAC

The alignment can be imported as follows:

    >>> import egglib
    >>> align = egglib.Align("data1.fas")
    
The constructor of :class:`~egglib.Align` requires that all
sequences have the same length. For non-aligned data (sequence sets),
use the :class:`~egglib.Container` class instead. It is also possible
to create an instance from a fasta-formatted string loaded in memory,
as in the example below (this let you modify the string, if needed):

    >>> f = open("data1.fas")
    >>> fasta = f.read()
    >>> f.close()
    >>> align = egglib.Align(string=fasta)

--------------------------------------------
Import population information as group label
--------------------------------------------

Both :class:`~egglib.Container` and :class:`~egglib.Align` can import
*group labels*, which are special tags present in the sequence names.
This feature is disabled by default (because such tags are not defined
in the FASTA format). Note that group labels are always present in
:class:`~egglib.Container` and :class:`~egglib.Align` instances and that
they are initialized to the default value ``0`` if the option described
hereafter is not activated.

You can activate group label detection by using the flag option *groups*
of the constructors. In such case, do avoid to use ``@`` anywhere in
sequence names. If names in your imported file or string contains labels
such as ``@0``, ``@1``, ``@42`` (``@`` followed by an integer, either at
the end of the string or followed by a space), they will be recognized.
Assume that ``data2.fas`` contains such labels, as reproduced below
(sequences have been omitted):

::

    >AY446407
    >AY446408@0
    >AY446409@1
    >AY446410 @18
    >AY446411@4 @6
    >AY446412@X
    >AY446413@7 @X
    >AY446414@7 @0
    >AY446415@42 POP42
    >AY446416@8POP8

Some labels have been deliberately placed in strange positions or
repeated. Now you can import the data and check how the names have been
processed (the iterator will be addressed in the next section):

>>> align = egglib.Align("data2.fas", groups=True)
>>> for name,sequence,group in align:
>>>     print 'name:',name, 'group:',group
...

results in:

::

    name: AY446407 group: 0
    name: AY446408 group: 0
    name: AY446409 group: 1
    name: AY446410  group: 18
    name: AY446411 group: 46
    name: AY446412 group: 0
    name: AY446413 group: 7
    name: AY446414 group: 1
    name: AY446415POP42 group: 42
    name: AY446416 group: 8

Note first that the names have been modified: the group index tags have
been removed from the name and, in the case of malformed, repeated tags
or tags not placed at the end of the sequence, it will not be possible
to generate the exact same file anymore (the last sequence, for example,
will be ``AY446416@8``. Note also that the group labels are really
labels (not all values need to be used, and the order is not relevant).
Also, importantly, group labels are always integers.

    * ``AY446407``: This has been interpreted as ``AY446407@0`` (the
      value ``0`` is the default).
    * ``AY446408@0``: The group label is ``0``.
    * ``AY446409@1``: The group label is ``1``.
    * ``AY446410 @18``: The group label is ``18``. Note that many values
      have been skipped. It will never be assumed that populations 2
      through 17 exist. Note also that there is a space between the
      name itself and the tag, and that the space is still present in
      the sequence name: the name really is ``"AY446410 "``. **This is
      error-prone and should be avoided.**
    * ``AY446411@4 @6``: Two labels have been entered, and they have
      been concatenated to form the integer ``46``. This syntax is not
      recommended.
    * ``AY446412@X``: The label is not integer: the data was lost.
    * ``AY446413@7 @X``: One of the labels is not an integer, this was
      lost (avoid using repeated tags).
    * ``AY446414@7 @0``: again a repeated tag. Note that, in the
      previous example, ``@X`` really was ignored (it was not changed
      into ``0``).
    * ``AY446415@42 POP42``: Additional information is added after the
      group label. One the tag and the space after it were removed.
      This syntax is not recommended.
    * ``AY446416@8POP8``: the space after the group label tag was
      omitted. Data has been lost.

.. note:: When using ``groups=True`` when importing fasta data, always
          place group label tags at the end of the name without space.
          It will prevent lots of errors and allow consistent
          input/output operations (the input name can be generated again
          when exporting), which is the preferable behaviour.

.. note:: The group label ``999`` (entered as ``@999`` in input file)
          has a special meaning: it identifies an outgroup sequence. It
          is possible to enter several outgroup sequences.


---------------------------------------
Conversion to Container and Align types
---------------------------------------

The :meth:`create` method can be called on either
:class:`~egglib.Container` or :class:`~egglib.Align` classes to create
corresponding instances. :meth:`create` expects as argument an object of
any iterable type, provided that the iterations return either
``(name, sequence)`` or ``(name, sequence, group)`` tuples. Note that
both :class:`~egglib.Container` and :class:`~egglib.Align` are
compatible. Therefore, :meth:`create` can be used to perform a deep
copy of sequence sets/alignments or convert one type into another:

>>> copy = egglib.Align.create(align)
>>> container = egglib.Container.create(align)

Convert from :class:`~egglib.Container` to :class:`~egglib.Align` is
unlikely to be needed. Note that conversion from :class:`~egglib.Container`
to :class:`~egglib.Align` is only possible when all sequences have the 
same length, which can be fixed either using alignment utilities or
using the :meth:`~egglib.Container.equalize` method.

It is also possible to use :meth:`create` to create an instance from
an ad hoc object (provided that iterations returns two- or three-item
tuples:

>>> sequences = [ ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT', 0),
...               ('name2', 'GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT', 0),
...               ('name3', 'GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT', 0),
...               ('name4', 'GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT', 1),
...               ('name5', 'GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT', 1),
...               ('name6', 'GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT', 1) ]
>>> align = egglib.Align.create(sequences)

Remember that sequences must have the same length to be imported as an
:class:`~egglib.Align`:

>>> sequences = [ ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCG'),
...               ('name2', 'GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT') ]
>>> container = egglib.Container.create(sequences)
>>> print 'Container created, length:', len(container)
>>> align = egglib.Align.create(sequences)
>>> print 'Align created, length:', len(align)

results in:

::

    Container created, length: 2
    Traceback (most recent call last):
        ... traceback omitted ...
    ValueError: Sequence doesn't match the alignment length


---------------------
Exports sequence data
---------------------

Default format for sequence data is also Pearson's FASTA format. FASTA
strings are automatically generated using the ``print`` statement (the
``align`` object used here is the same as the last but one example):

>>> print align

results in:

::

    >name1
    GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT
    >name2
    GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT
    >name3
    GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT
    >name4
    GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT
    >name5
    GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT
    >name6
    GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT

This doesn't allow to export group labels. To do so, use the
:meth:`str` method, which take a ``exportGroupLabels`` option (and is
available on both :class:`~egglib.Container` and :class:`~egglib.Align`):

>>> string = align.str(exportGroupLabels=True)
>>> print string

results in:

::

    >name1@0
    GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT
    >name2@0
    GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT
    >name3@0
    GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT
    >name4@1
    GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT
    >name5@1
    GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT
    >name6@1
    GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT

To export data directly to a file, use the :meth:`write` method (also
available on both :class:`~egglib.Container` and :class:`~egglib.Align`).
This method calls underlying C++ code and, on large datasets, is far
more efficient than generating the full string and placing it to a file.
The below example generates a file containing the string of the last
example:

>>> align.write("data3.fas", True)

In particular in the case of :class:`~egglib.Align` instances, output
in alternative formats can be performed using :meth:`~egglib.Align.nexus`,
:meth:`~egglib.Align.phylip` and :meth:`~egglib.Align.phyml` methods.
Conversely, a few alternative alignment formats are available for input
in the :mod:`~egglib.tools` module (see the section "Specific data
formats").

Manipulation of sequence sets and alignments
============================================

---------
Iteration
---------

Iteration is frequently useful, in particular for sequence data. When
you iterate over a :class:`~egglib.Container` or a :class:`~egglib.Align`
instance, each iteration step yields a :class:`~egglib.SequenceItem`
object. You should never need to instanciate a :class:`~egglib.SequenceItem`
yourself, and using it should be as simple as accessing it three
attributes :attr:`name`, :attr:`sequence` and :attr:`group`. This class
is designed to allow access/modify the instance without necessarily
copy the sequence data. In the current implementation, the sequence data
will still be extracted if :attr:`~egglib.SequenceItem.sequence` is
accessed, however. Note that :class:`~egglib.SequenceItem` instances
keep a record of their parent :class:`~egglib.Container` or
:class:`~egglib.Align` instance and that, preferably, they should not be
stored outside the loop.

Assume you have a FASTA file ``data4.fas`` with an alternative way of
coding population membership (underscore instead of ``@``):

::

    >name1_1
    GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT
    >name2_1
    GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT
    >name3_1
    GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT
    >name4_2
    GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT
    >name5_2
    GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT
    >name6_2
    GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT

Here is how to import these labels and modify the underlying instance:

>>> align = egglib.Align("data4.fas")
>>> for i in align:
...    root, label = i.name.split('_')
...    label = int(label)
...    i.name = root
...    i.group = label
>>> align.write("data5.fas", True)

This generates the following file:

::

    >name1@1
    GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT
    >name2@1
    GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT
    >name3@1
    GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT
    >name4@2
    GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT
    >name5@2
    GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT
    >name6@2

The syntax ``for name, sequence, group in align`` is supported but has
two flaws: 1) it will cause a performance overhead in case sequences are
long and/or numerous and that they don't need to be accessed, and 2)
this doesn't allow to modify the underlying instances, since ``name``,
``sequence`` and ``group`` variables are strings and integer, and
therefore not modifiable.

.. note:: When setting the :attr:`~egglib.SequenceItem.sequence`
          attribute of a :class:`~egglib.SequenceItem`, the new sequence
          must match the length of the alignment.

---------------------
Sequence data edition
---------------------

It is possible to access (read or modify) full data at a given position
using subscript indexing (``[]``). The operator takes or returns a
three-item tuple:

>>> print align[0]
>>> align[0] = 'first name', 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 8
>>> print align[0]
>>> align[0] = 'first name', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'
>>> print align[0]

results in:

::

    ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT', 1)
    ('first name', 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 8)
    ('first name', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT', 0)

Note that, if the group label is omitted, the value of ``0`` is assumed
and replaces the previous value.

It is possible to use negative indices, as in ``align[-1]``. The ``del``
statement is allowed for removing a full entry (name, sequence and group)
as in ``del align[2]``. To remove a sequence based on its name, use
:meth:`~egglib.Align.remove`.

It is never allowed to access indices equal to or larger than the
current length of the sequence set or alignment, including to add a
new sequence (see below). The methods :meth:`~egglib.Align.set` and
:meth:`~egglib.Align.get` allow
to access a single character at a specific position. Note also that
the overloaded methods :meth:`~egglib.Align.name`,
:meth:`~egglib.Align.sequence` and :meth:`~egglib.Align.group`
allow to access/modify only one of these attribute directly at a given
position. :meth:`~egglib.Align.sequenceByName` allows to get a sequence
string using the name and not the index.

It is possible to add a stretch a sequence to a sequence at a given
position, on :class:`~egglib.Container` only, using the method
:meth:`~egglib.Container.appendSequence`.

The method :meth:`~egglib.Align.append` allows to add a new sequence at
the end of the sequence set or alignment, and the method
:meth:`~egglib.Align.addSequences` allows to do it repeatively.

On a :class:`~egglib.Align` instance, the methods
:meth:`~egglib.Align.column` and :meth:`~egglib.Align.removePosition`
allow to (respectively) access and remove a full column.

Finally, the code below allow to extract the first three sequences of
an alignment, and only the first 10 and the last 10 positions:

>>> part = align.slice(0, 3)
>>> part = part.extract(range(10)+range(-10,-1))
>>> print part

results in:

::

    >name1
    GAGCGTGCCGTGCCCGTGA
    >name2
    GAGCGTGCCGTGCGCGTGA
    >name3
    GAGCGTGTCGTGCGCGTGA


Manipulation of other data types (trees, annotated sequences, SSR)
==================================================================

Phylogenetic trees are implemented as the class :class:`~egglib.Tree`
and GenBank flatfile data as the class :class:`~egglib.GenBank`. Both
can also be instanciated from files and strings. Similarly to sequence
sets and alignments, iteration over instances of these two classes
yield ad hoc class instances allowing to manage, respectively, a node of
the tree and a feature. Available methods on trees allow to edit the
them (such as rooting) and explore them (such as identifying monophyletic
groups), while methods on GenBank flatfile data allow to access and
edit annotation information and extract data (such as coding sequence
strings). The :class:`~egglib.SSR` class is designed to import, export
and compute statistics on microsatellite data.

Analysis of polymorphism
========================

Function to compute polymorphism are available as methods of the
:class:`~egglib.Align` and :class:`~egglib.SSR` classes:

    * :meth:`~egglib.Align.polymorphism`: compute standard statistics,
      neutrality indices, and differentiation statistics (if applicable).
      
    * :meth:`~egglib.Align.polymorphismBPP`: compute statistics
      available through the Bio++ library (including statistics based on
      nonsynonymous/synonymous variation, if applicable). Bio++ must
      have been available when installing EggLib to allow using this
      function.

    * :meth:`~egglib.SSR.stats`: compute microsatellite statistics.

    * :meth:`~egglib.SSR.Fstats`: compute F-statistics (based on Weir
      and Cockerham formulas) for microsatellite.
      
    Note that the first three methods return dictionaries and that, in
    the case of the first two methods, the content of these dictionaries
    will depend on both the value of options (some statistics can be
    skipped) and on the content of the instance (differentiation
    statistics are not computed if only one population is present).
    Please refer to the documentation of these methods for more details.

Coalescent simulations
======================

The structure of the :mod:`~egglib.simul` module (see its description)
is designed as a compromise between flexibility (make available all
features of the underlying C++ library), usability (not being too clumsy
with a large number of options in a single function) and simplicity
(reduce the number of required steps). The below example shows how to
perform a given number of simulations of a moderately complex model
(two populations that of which one is the result of a founding event
from the first one, and a standard model of mutation with an infinite
number of sites). Note that many more elaborate scenarios and mutation
models are available.


>>> # sets parameters
>>> sample1 = 40   # number of sample in pop 1
>>> sample2 = 40   # number of sample in pop 2
>>> date1 = 0.20   # end of bottleneck
>>> date2 = 0.25   # split date/start of bottleneck
>>> strength = 0.1 # strength of the bottleneck
>>> migr = 0.0     # migration rate (default is 0.1)
>>> theta = 2.     # theta for gene
>>> nrepets = 1000 # number of simulations
>>> 
>>> # sets the two parameter classes
>>> paramSet = egglib.simul.CoalesceParamSet([sample1, sample2], M=migr)
>>> paramSet.changeSinglePopulationSize(date1, 1, strength)
>>> paramSet.populationFusion(date2, 0, 1)
>>> mutator = egglib.simul.CoalesceFiniteAlleleMutator(theta=2.)
>>> 
>>> # performs simulations
>>> aligns = egglib.simul.coalesce(paramSet, mutator, nrepets)
>>> 
>>> # computes statistics
>>> pol_data = map(egglib.Align.polymorphism, aligns)
>>> D = [i['D'] for i in pol_data]
>>> Fst = [i['Fst'] for i in pol_data]
>>> print 'average D:', sum(D)/nrepets
>>> print 'average Fst:', sum(Fst)/nrepets

Note that is is not necessary to formally end the bottleneck by
``paramSet.changeSinglePopulationSize(date0, 1, 1.)`` since this
population is removed at the same date.

Here's the result of a particular run:

::

    average D: -0.557003767734
    average Fst: 0.373726230187


Interactive usage and Approximate Bayesian Computation
======================================================

In this section we describe the command line tools available through
EggLib. A number of short programs have been written under the form
of subclasses of :class:`~egglib.utils.BaseCommand` in the module
:mod:`~egglib.utils`. The ``egglib`` script (which is installed at the
same tyme as the EggLib Python module), lets you run automatically these
programs. In this section we present some generalities on the general
usage of the ``egglib`` script, and then provide an example of usage
with a tutorial for ABC analyses using such commands. Note that all
commands presented below are system commands (entered through a command
terminal) and not Python code.

-----------------
The egglib script
-----------------

The general usage of the ``egglib`` script is:

::

    $ egglib <command> <options>

where *command* is the name of any of the available programs. An
alternative syntax is available as ``python -m egglib.utils <command> <options>``

By typing

::

    $ egglib
    
only, a list of all commands will be displayed. By typing the command
name only (without options), you will be provided with the manual page
of this command. For example in the case of one of the simplest commands,
*infos*::

    $ egglib infos

Concerning *options*, they are variables among commands (the command's
manual page always provide the list and documentation). It is important
to note that, even if the syntax is standardized over different commands,
they really are different programs with different syntaxes (see however
the *quiet* and *debug* boolean flags below. Options can be of two types:

    * **Keyword arguments**. They are keyword-identified values, entered in
      the syntax ``keyword=value`` where *keyword* is one of the
      available options and *value* the value you wish to enter. They
      are usually some restrictions on the format of the value and some
      options are flexible. In some cases (especially for values that
      contain characters that can be mistaken with command interpreter
      modifiers such a spaces and semicolons) it can be necessary to
      wrap the option in quotes. Some keyword arguments have a default
      value, and some have not. The latter must be entered to run the
      command. The *input* option is found in most commands, and
      frequently expects a file name as in ``input=file`` (allowing
      sometimes to enter multiple file names as in
      ``input=file1,file2,file3``), but in a few cases it expects the
      name of a directory containing files to process.
      
    * **Boolean flags**. They are entered as a single token and they
      activate a specified feature. They are never required and are
      always off by default (but note that some flags have a negative
      effect, such a ``no_check`` in command ``cprimers``). Two flags
      are available automatically for all commands:
      
        * ``quiet``: If this flag is activated, the command will run
          without console output. Note that the flag has not effect on
          commands that don't generate console output anyway, as well as
          commands whose only task is to generate console output (this
          is the case for the ``infos`` command, for example).
        
        * ``debug``: By default, all errors are intercepted by the
          *egglib* script and summarized in a one-line error statement.
          The flag ``debug`` allows to output the full error traceback
          (this is useful for debugging and reporting errors).
      
Options (keyword arguments as well as boolean flags) can be entered in
any order, but cannot be repeated. It is always required to enter
at least one keyword argument (sometimes, several must be entered).

Here is the end of the example with the ``infos`` command. Here is how
to get information on the ``data4.fas``::

    $ egglib infos input=data4.fas 
    data4.fas
     ... 6 sequences
     ... alignment
     ... length: 38

And how to get information on several files (``data2.txt`` contains
names but no sequences)::

    $ egglib infos input=data1.fas,data2.fas,data3.fas,data4.fas 
    data1.fas
     ... 10 sequences
     ... alignment
     ... length: 160
    data2.fas
     ... 10 sequences
     ... alignment
     ... length: 0
    data3.fas
     ... 6 sequences
     ... alignment
     ... length: 38
    data4.fas
     ... 6 sequences
     ... alignment
     ... length: 38

The default error statement (the file ``data6.fas`` doesn't exist)::

    $ egglib infos input=data6.fas
    An error occurred: [ValueError] cannot open data6.fas

And how to get more information::

    $ egglib infos input=data6.fas debug
    An error occurred: [ValueError] cannot open data6.fas
    Traceback (most recent call last):
      File "/usr/bin/egglib", line 67, in <module>
        obj.run(*args, **kwargs)
      File "/usr/lib64/python2.7/site-packages/egglib/utils.py", line 209, in run
        self._run(*fargs, **kwargs)
      File "/usr/lib64/python2.7/site-packages/egglib/utils.py", line 3386, in _run
        container = data.Container(fname)
      File "/usr/lib64/python2.7/site-packages/egglib/data.py", line 767, in __init__
        if not os.path.isfile(fname): raise ValueError, 'cannot open %s' %fname
    ValueError: cannot open data6.fas
    
(Of course in that case the source of the error can be identified with
the message alone. In that case, the exception has been thrown by the
:class:`~egglib.Align` constructor when the file was not found.)


------------
ABC sampling
------------

In this example we will consider a simulated dataset of 10 independent
alignments of 1000 nucleotides, simulated under a model containing
2 isolated populations of equal size and connected with a bidirectional
migration rate (4Nm) of 0.5, with a mutation rate (4Nu) of 3.5 for gene
(that is 0.0035 per site). 20 sequences were sampled in each population.
We will see if it is possible to reject the standard neutral model (SNM)
in favour of the (real) island model with, in that case, two demes (IM)
and if it is possible to correctly estimate the parameters.

We will first sample random parameters in uniform priors under these two
alternative models (SNM and IM) and perform one simulation for each
sample (one simulation contains the 10 loci). The command to do so is::

    egglib abc_sample dir=fas params=SNM.txt data=SNM.out model=SNM prior="%U(0;0.01)" stats=SFS:4

The following arguments were entered:

    * *dir*: The name of a directory containing fasta files to analyze.
    * *params*: The name of the run parameter output file.
    * *data*: The name of data output file.
    * *model*: The label of the demographic model, here the Standard
      Neutral Model.
    * *prior*: The definition of the prior. Here, the prior is defined
      as a string (starting with a ``%`` character to indicate that the
      argument is not a file name), and specifies a uniform distribution
      bound between 0 and 0.01 (the SNM model has only one parameter).
    * *stats*: The set of summary statistics, here the Site Frequency
      Spectrum. This particular set of summary statistics expects an
      argument (the number of classes in the spectrum), which is passed
      after the ``:`` character.

The following parameters were left to the default values:

    * *ext* (``fas``): This argument allows to set the extension of
      fasta files to process.
    * *post* (10000): The number of point to sample for posterior
      estimation.
    * *step* (100): The rate of refreshing of the progress information.
    * *seeds* (automatic): Random number generator seeds.
    * *restart*: This option should be used to resume an interrupted. It
      takes a run parameter file (the file named by the *params* option)
      and will attempt to finish the run. No other options should be
      entered along with *restart*.

----------------------------------
ABC rejection-regression procedure
----------------------------------

The ``abc_fit`` command performs the steps of the standard
rejection-regression procedure described in Beaumont et al. 2002 Genetics
using the underlying C++ class :class:`~egglib.egglib_binding.ABC`::

    egglib abc_fit input=SNM.txt tolerance=0.1 transform=tan output=SNM.fit
    
Here we filled all arguments. The input file is ``SNM.txt`` which
actually contains only the values of the run parameters. The actual data
file is given as one of the parameters and must be found in the same
directory. The tolerance gives the proportion of points to select in
the region closest to the observed statistics. This defines the local
region and, in practice, should be more stringent than in this example.
 The fitting procedure can be applied to data from a unique model only.
The transformation of parameter values (here, tangential transformation)
is useful in particular to prevent parameters from exceeding prior bounds.
The output file will contain only parameter values, selected and corrected,
and will be considered as a posterior distribution.

--------------------
Comparing ABC models
--------------------

The rejection-based method of Fagundes et al. 2007 PNAS allows to
compare alternative model without relying on Bayes Factors. It is based
on the rejection step of the rejection-regression procedure, which
doesn't use the parameter values and therefore allows to merge several
models. The model probabilities are estimated as the proportion of
points originating from each model in a pre-defined number of accepted
points::

    egglib abc_compare input=SNM.txt,PEM.txt tolerance=0.1 output=compare.txt
    
The different sample files are passed in the *input* argument. The
*tolerance* argument is the same as for the fitting procedure, but note
that here it applies to the sum of points over all the passed models.
The output file will contain the accepted points, along with their
weigths. The model probabilities will be displayed in the standard
output (the terminal).

--------------------------------
Analyzing posterior distribution
--------------------------------

A set of commands are designed to help analyzing posteriors, after
they have been fitted.

    * ``abc_bin`` discretizes posteriors into a regular grid over all
      parameter dimensions.
      
    * ``abc_statsmarg`` provide statistics for individual parameters
      independently, from a fitted posterior.
      
    * ``abc_statsdisc`` (for models with more than one parameter),
      analyzes a discretized posterior (and, in particular, identifies
      the maximum density point of the distribution).
      
    * ``abc_plot1D`` and ``abc_plot2D`` display graphical representations
      parameters (respectively, a single parameter and the covariation
      of two parameters). They require that :mod:`matplotlib` is
      available and they use discretized posteriors.
      
    * ``abc_psimuls``, finally, allows to use posteriors to generate
      new simulations for testing neutrality of loci or assess
      goodness-of-fit of models.