.. _manual: ====== Manual ====== This section provides primers on the usage of the Python module and of the executable commands. In the following, representative examples of the most common uses of Egglib will be addressed, assuming that the user aims at developing simple scripts (that is, without extensive error checking). Import/export of sequence data ============================== Egglib relies heavily on two classes: sequence sets (that are not considered as aligned), :class:`~egglib.Container`, and sequence alignments (for which all sequences must have the same lengths), :class:`~egglib.Align`. Many operations are available as methods of these two classes (including polymorphism analysis). .. note:: There is no sequence validity check at the level of storage classes. The parser only skips all white spaces and imports characters as upper cases, but it accepts all characters, and all characters are allowed when setting sequences within Python. Checking is performed at latter stages. -------------------------------------------- Import from FASTA-formatted files or strings -------------------------------------------- Egglib is based primarily on Pearson's FASTA format for data input, due to its simplicity and the fact that it can be generated by many other software. In Python, the constructor of both :class:`~egglib.Container` and :class:`~egglib.Align` includes a parser that can take a file name. Creating a sequence set/alignment objects is then concise. Assuming the file ``data1.fas`` contain a sequence alignment formatted as follows: :: >AY446407 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- --------CACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446408 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- --------CACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446409 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- --------CACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446410 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- --------CACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446411 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- --------CACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446412 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446413 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446414 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446415 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACA-- ----CACACACACACACGTGGCACTTTCTTCCTGCGGCAC >AY446416 GAGGCTCATCAGCCTGTTATTTACTGAAATTGAATGAAAAATGAGAGAGACGAGAAATGA GAAAAA-AATAAAATAAA-----ATAAAATAGTTCAGTTATGGATAAGCAAATACACAC- -ACACACACACACACACGTGGCACTTTCTTCCTGCGGCAC The alignment can be imported as follows: >>> import egglib >>> align = egglib.Align("data1.fas") The constructor of :class:`~egglib.Align` requires that all sequences have the same length. For non-aligned data (sequence sets), use the :class:`~egglib.Container` class instead. It is also possible to create an instance from a fasta-formatted string loaded in memory, as in the example below (this let you modify the string, if needed): >>> f = open("data1.fas") >>> fasta = f.read() >>> f.close() >>> align = egglib.Align(string=fasta) -------------------------------------------- Import population information as group label -------------------------------------------- Both :class:`~egglib.Container` and :class:`~egglib.Align` can import *group labels*, which are special tags present in the sequence names. This feature is disabled by default (because such tags are not defined in the FASTA format). Note that group labels are always present in :class:`~egglib.Container` and :class:`~egglib.Align` instances and that they are initialized to the default value ``0`` if the option described hereafter is not activated. You can activate group label detection by using the flag option *groups* of the constructors. In such case, do avoid to use ``@`` anywhere in sequence names. If names in your imported file or string contains labels such as ``@0``, ``@1``, ``@42`` (``@`` followed by an integer, either at the end of the string or followed by a space), they will be recognized. Assume that ``data2.fas`` contains such labels, as reproduced below (sequences have been omitted): :: >AY446407 >AY446408@0 >AY446409@1 >AY446410 @18 >AY446411@4 @6 >AY446412@X >AY446413@7 @X >AY446414@7 @0 >AY446415@42 POP42 >AY446416@8POP8 Some labels have been deliberately placed in strange positions or repeated. Now you can import the data and check how the names have been processed (the iterator will be addressed in the next section): >>> align = egglib.Align("data2.fas", groups=True) >>> for name,sequence,group in align: >>> print 'name:',name, 'group:',group ... results in: :: name: AY446407 group: 0 name: AY446408 group: 0 name: AY446409 group: 1 name: AY446410 group: 18 name: AY446411 group: 46 name: AY446412 group: 0 name: AY446413 group: 7 name: AY446414 group: 1 name: AY446415POP42 group: 42 name: AY446416 group: 8 Note first that the names have been modified: the group index tags have been removed from the name and, in the case of malformed, repeated tags or tags not placed at the end of the sequence, it will not be possible to generate the exact same file anymore (the last sequence, for example, will be ``AY446416@8``. Note also that the group labels are really labels (not all values need to be used, and the order is not relevant). Also, importantly, group labels are always integers. * ``AY446407``: This has been interpreted as ``AY446407@0`` (the value ``0`` is the default). * ``AY446408@0``: The group label is ``0``. * ``AY446409@1``: The group label is ``1``. * ``AY446410 @18``: The group label is ``18``. Note that many values have been skipped. It will never be assumed that populations 2 through 17 exist. Note also that there is a space between the name itself and the tag, and that the space is still present in the sequence name: the name really is ``"AY446410 "``. **This is error-prone and should be avoided.** * ``AY446411@4 @6``: Two labels have been entered, and they have been concatenated to form the integer ``46``. This syntax is not recommended. * ``AY446412@X``: The label is not integer: the data was lost. * ``AY446413@7 @X``: One of the labels is not an integer, this was lost (avoid using repeated tags). * ``AY446414@7 @0``: again a repeated tag. Note that, in the previous example, ``@X`` really was ignored (it was not changed into ``0``). * ``AY446415@42 POP42``: Additional information is added after the group label. One the tag and the space after it were removed. This syntax is not recommended. * ``AY446416@8POP8``: the space after the group label tag was omitted. Data has been lost. .. note:: When using ``groups=True`` when importing fasta data, always place group label tags at the end of the name without space. It will prevent lots of errors and allow consistent input/output operations (the input name can be generated again when exporting), which is the preferable behaviour. .. note:: The group label ``999`` (entered as ``@999`` in input file) has a special meaning: it identifies an outgroup sequence. It is possible to enter several outgroup sequences. --------------------------------------- Conversion to Container and Align types --------------------------------------- The :meth:`create` method can be called on either :class:`~egglib.Container` or :class:`~egglib.Align` classes to create corresponding instances. :meth:`create` expects as argument an object of any iterable type, provided that the iterations return either ``(name, sequence)`` or ``(name, sequence, group)`` tuples. Note that both :class:`~egglib.Container` and :class:`~egglib.Align` are compatible. Therefore, :meth:`create` can be used to perform a deep copy of sequence sets/alignments or convert one type into another: >>> copy = egglib.Align.create(align) >>> container = egglib.Container.create(align) Convert from :class:`~egglib.Container` to :class:`~egglib.Align` is unlikely to be needed. Note that conversion from :class:`~egglib.Container` to :class:`~egglib.Align` is only possible when all sequences have the same length, which can be fixed either using alignment utilities or using the :meth:`~egglib.Container.equalize` method. It is also possible to use :meth:`create` to create an instance from an ad hoc object (provided that iterations returns two- or three-item tuples: >>> sequences = [ ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT', 0), ... ('name2', 'GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT', 0), ... ('name3', 'GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT', 0), ... ('name4', 'GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT', 1), ... ('name5', 'GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT', 1), ... ('name6', 'GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT', 1) ] >>> align = egglib.Align.create(sequences) Remember that sequences must have the same length to be imported as an :class:`~egglib.Align`: >>> sequences = [ ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCG'), ... ('name2', 'GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT') ] >>> container = egglib.Container.create(sequences) >>> print 'Container created, length:', len(container) >>> align = egglib.Align.create(sequences) >>> print 'Align created, length:', len(align) results in: :: Container created, length: 2 Traceback (most recent call last): ... traceback omitted ... ValueError: Sequence doesn't match the alignment length --------------------- Exports sequence data --------------------- Default format for sequence data is also Pearson's FASTA format. FASTA strings are automatically generated using the ``print`` statement (the ``align`` object used here is the same as the last but one example): >>> print align results in: :: >name1 GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT >name2 GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT >name3 GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT >name4 GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT >name5 GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT >name6 GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT This doesn't allow to export group labels. To do so, use the :meth:`str` method, which take a ``exportGroupLabels`` option (and is available on both :class:`~egglib.Container` and :class:`~egglib.Align`): >>> string = align.str(exportGroupLabels=True) >>> print string results in: :: >name1@0 GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT >name2@0 GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT >name3@0 GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT >name4@1 GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT >name5@1 GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT >name6@1 GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT To export data directly to a file, use the :meth:`write` method (also available on both :class:`~egglib.Container` and :class:`~egglib.Align`). This method calls underlying C++ code and, on large datasets, is far more efficient than generating the full string and placing it to a file. The below example generates a file containing the string of the last example: >>> align.write("data3.fas", True) In particular in the case of :class:`~egglib.Align` instances, output in alternative formats can be performed using :meth:`~egglib.Align.nexus`, :meth:`~egglib.Align.phylip` and :meth:`~egglib.Align.phyml` methods. Conversely, a few alternative alignment formats are available for input in the :mod:`~egglib.tools` module (see the section "Specific data formats"). Manipulation of sequence sets and alignments ============================================ --------- Iteration --------- Iteration is frequently useful, in particular for sequence data. When you iterate over a :class:`~egglib.Container` or a :class:`~egglib.Align` instance, each iteration step yields a :class:`~egglib.SequenceItem` object. You should never need to instanciate a :class:`~egglib.SequenceItem` yourself, and using it should be as simple as accessing it three attributes :attr:`name`, :attr:`sequence` and :attr:`group`. This class is designed to allow access/modify the instance without necessarily copy the sequence data. In the current implementation, the sequence data will still be extracted if :attr:`~egglib.SequenceItem.sequence` is accessed, however. Note that :class:`~egglib.SequenceItem` instances keep a record of their parent :class:`~egglib.Container` or :class:`~egglib.Align` instance and that, preferably, they should not be stored outside the loop. Assume you have a FASTA file ``data4.fas`` with an alternative way of coding population membership (underscore instead of ``@``): :: >name1_1 GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT >name2_1 GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT >name3_1 GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT >name4_2 GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT >name5_2 GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT >name6_2 GAGCGTGTCGCGCGCGCGATGCCAAGAGTGCCCGCGAT Here is how to import these labels and modify the underlying instance: >>> align = egglib.Align("data4.fas") >>> for i in align: ... root, label = i.name.split('_') ... label = int(label) ... i.name = root ... i.group = label >>> align.write("data5.fas", True) This generates the following file: :: >name1@1 GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT >name2@1 GAGCGTGCCGCGAGAGCGTTGCCAAGATTGCGCGTGAT >name3@1 GAGCGTGTCGCGAGAGCGTTGCCAAGAGTGCGCGTGAT >name4@2 GAGCGTGTCGCGCGCGCGTTGCCAACAGTGCCCGCGAT >name5@2 GAGCGTGTCGCGCGCGCGTTGCCAAGAGTGCCCGCGAT >name6@2 The syntax ``for name, sequence, group in align`` is supported but has two flaws: 1) it will cause a performance overhead in case sequences are long and/or numerous and that they don't need to be accessed, and 2) this doesn't allow to modify the underlying instances, since ``name``, ``sequence`` and ``group`` variables are strings and integer, and therefore not modifiable. .. note:: When setting the :attr:`~egglib.SequenceItem.sequence` attribute of a :class:`~egglib.SequenceItem`, the new sequence must match the length of the alignment. --------------------- Sequence data edition --------------------- It is possible to access (read or modify) full data at a given position using subscript indexing (``[]``). The operator takes or returns a three-item tuple: >>> print align[0] >>> align[0] = 'first name', 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 8 >>> print align[0] >>> align[0] = 'first name', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT' >>> print align[0] results in: :: ('name1', 'GAGCGTGCCGCGAGAGCGTTGCCAAGAGTGCCCGTGAT', 1) ('first name', 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA', 8) ('first name', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT', 0) Note that, if the group label is omitted, the value of ``0`` is assumed and replaces the previous value. It is possible to use negative indices, as in ``align[-1]``. The ``del`` statement is allowed for removing a full entry (name, sequence and group) as in ``del align[2]``. To remove a sequence based on its name, use :meth:`~egglib.Align.remove`. It is never allowed to access indices equal to or larger than the current length of the sequence set or alignment, including to add a new sequence (see below). The methods :meth:`~egglib.Align.set` and :meth:`~egglib.Align.get` allow to access a single character at a specific position. Note also that the overloaded methods :meth:`~egglib.Align.name`, :meth:`~egglib.Align.sequence` and :meth:`~egglib.Align.group` allow to access/modify only one of these attribute directly at a given position. :meth:`~egglib.Align.sequenceByName` allows to get a sequence string using the name and not the index. It is possible to add a stretch a sequence to a sequence at a given position, on :class:`~egglib.Container` only, using the method :meth:`~egglib.Container.appendSequence`. The method :meth:`~egglib.Align.append` allows to add a new sequence at the end of the sequence set or alignment, and the method :meth:`~egglib.Align.addSequences` allows to do it repeatively. On a :class:`~egglib.Align` instance, the methods :meth:`~egglib.Align.column` and :meth:`~egglib.Align.removePosition` allow to (respectively) access and remove a full column. Finally, the code below allow to extract the first three sequences of an alignment, and only the first 10 and the last 10 positions: >>> part = align.slice(0, 3) >>> part = part.extract(range(10)+range(-10,-1)) >>> print part results in: :: >name1 GAGCGTGCCGTGCCCGTGA >name2 GAGCGTGCCGTGCGCGTGA >name3 GAGCGTGTCGTGCGCGTGA Manipulation of other data types (trees, annotated sequences, SSR) ================================================================== Phylogenetic trees are implemented as the class :class:`~egglib.Tree` and GenBank flatfile data as the class :class:`~egglib.GenBank`. Both can also be instanciated from files and strings. Similarly to sequence sets and alignments, iteration over instances of these two classes yield ad hoc class instances allowing to manage, respectively, a node of the tree and a feature. Available methods on trees allow to edit the them (such as rooting) and explore them (such as identifying monophyletic groups), while methods on GenBank flatfile data allow to access and edit annotation information and extract data (such as coding sequence strings). The :class:`~egglib.SSR` class is designed to import, export and compute statistics on microsatellite data. Analysis of polymorphism ======================== Function to compute polymorphism are available as methods of the :class:`~egglib.Align` and :class:`~egglib.SSR` classes: * :meth:`~egglib.Align.polymorphism`: compute standard statistics, neutrality indices, and differentiation statistics (if applicable). * :meth:`~egglib.Align.polymorphismBPP`: compute statistics available through the Bio++ library (including statistics based on nonsynonymous/synonymous variation, if applicable). Bio++ must have been available when installing EggLib to allow using this function. * :meth:`~egglib.SSR.stats`: compute microsatellite statistics. * :meth:`~egglib.SSR.Fstats`: compute F-statistics (based on Weir and Cockerham formulas) for microsatellite. Note that the first three methods return dictionaries and that, in the case of the first two methods, the content of these dictionaries will depend on both the value of options (some statistics can be skipped) and on the content of the instance (differentiation statistics are not computed if only one population is present). Please refer to the documentation of these methods for more details. Coalescent simulations ====================== The structure of the :mod:`~egglib.simul` module (see its description) is designed as a compromise between flexibility (make available all features of the underlying C++ library), usability (not being too clumsy with a large number of options in a single function) and simplicity (reduce the number of required steps). The below example shows how to perform a given number of simulations of a moderately complex model (two populations that of which one is the result of a founding event from the first one, and a standard model of mutation with an infinite number of sites). Note that many more elaborate scenarios and mutation models are available. >>> # sets parameters >>> sample1 = 40 # number of sample in pop 1 >>> sample2 = 40 # number of sample in pop 2 >>> date1 = 0.20 # end of bottleneck >>> date2 = 0.25 # split date/start of bottleneck >>> strength = 0.1 # strength of the bottleneck >>> migr = 0.0 # migration rate (default is 0.1) >>> theta = 2. # theta for gene >>> nrepets = 1000 # number of simulations >>> >>> # sets the two parameter classes >>> paramSet = egglib.simul.CoalesceParamSet([sample1, sample2], M=migr) >>> paramSet.changeSinglePopulationSize(date1, 1, strength) >>> paramSet.populationFusion(date2, 0, 1) >>> mutator = egglib.simul.CoalesceFiniteAlleleMutator(theta=2.) >>> >>> # performs simulations >>> aligns = egglib.simul.coalesce(paramSet, mutator, nrepets) >>> >>> # computes statistics >>> pol_data = map(egglib.Align.polymorphism, aligns) >>> D = [i['D'] for i in pol_data] >>> Fst = [i['Fst'] for i in pol_data] >>> print 'average D:', sum(D)/nrepets >>> print 'average Fst:', sum(Fst)/nrepets Note that is is not necessary to formally end the bottleneck by ``paramSet.changeSinglePopulationSize(date0, 1, 1.)`` since this population is removed at the same date. Here's the result of a particular run: :: average D: -0.557003767734 average Fst: 0.373726230187 Interactive usage and Approximate Bayesian Computation ====================================================== In this section we describe the command line tools available through EggLib. A number of short programs have been written under the form of subclasses of :class:`~egglib.utils.BaseCommand` in the module :mod:`~egglib.utils`. The ``egglib`` script (which is installed at the same tyme as the EggLib Python module), lets you run automatically these programs. In this section we present some generalities on the general usage of the ``egglib`` script, and then provide an example of usage with a tutorial for ABC analyses using such commands. Note that all commands presented below are system commands (entered through a command terminal) and not Python code. ----------------- The egglib script ----------------- The general usage of the ``egglib`` script is: :: $ egglib where *command* is the name of any of the available programs. An alternative syntax is available as ``python -m egglib.utils `` By typing :: $ egglib only, a list of all commands will be displayed. By typing the command name only (without options), you will be provided with the manual page of this command. For example in the case of one of the simplest commands, *infos*:: $ egglib infos Concerning *options*, they are variables among commands (the command's manual page always provide the list and documentation). It is important to note that, even if the syntax is standardized over different commands, they really are different programs with different syntaxes (see however the *quiet* and *debug* boolean flags below. Options can be of two types: * **Keyword arguments**. They are keyword-identified values, entered in the syntax ``keyword=value`` where *keyword* is one of the available options and *value* the value you wish to enter. They are usually some restrictions on the format of the value and some options are flexible. In some cases (especially for values that contain characters that can be mistaken with command interpreter modifiers such a spaces and semicolons) it can be necessary to wrap the option in quotes. Some keyword arguments have a default value, and some have not. The latter must be entered to run the command. The *input* option is found in most commands, and frequently expects a file name as in ``input=file`` (allowing sometimes to enter multiple file names as in ``input=file1,file2,file3``), but in a few cases it expects the name of a directory containing files to process. * **Boolean flags**. They are entered as a single token and they activate a specified feature. They are never required and are always off by default (but note that some flags have a negative effect, such a ``no_check`` in command ``cprimers``). Two flags are available automatically for all commands: * ``quiet``: If this flag is activated, the command will run without console output. Note that the flag has not effect on commands that don't generate console output anyway, as well as commands whose only task is to generate console output (this is the case for the ``infos`` command, for example). * ``debug``: By default, all errors are intercepted by the *egglib* script and summarized in a one-line error statement. The flag ``debug`` allows to output the full error traceback (this is useful for debugging and reporting errors). Options (keyword arguments as well as boolean flags) can be entered in any order, but cannot be repeated. It is always required to enter at least one keyword argument (sometimes, several must be entered). Here is the end of the example with the ``infos`` command. Here is how to get information on the ``data4.fas``:: $ egglib infos input=data4.fas data4.fas ... 6 sequences ... alignment ... length: 38 And how to get information on several files (``data2.txt`` contains names but no sequences):: $ egglib infos input=data1.fas,data2.fas,data3.fas,data4.fas data1.fas ... 10 sequences ... alignment ... length: 160 data2.fas ... 10 sequences ... alignment ... length: 0 data3.fas ... 6 sequences ... alignment ... length: 38 data4.fas ... 6 sequences ... alignment ... length: 38 The default error statement (the file ``data6.fas`` doesn't exist):: $ egglib infos input=data6.fas An error occurred: [ValueError] cannot open data6.fas And how to get more information:: $ egglib infos input=data6.fas debug An error occurred: [ValueError] cannot open data6.fas Traceback (most recent call last): File "/usr/bin/egglib", line 67, in obj.run(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/egglib/utils.py", line 209, in run self._run(*fargs, **kwargs) File "/usr/lib64/python2.7/site-packages/egglib/utils.py", line 3386, in _run container = data.Container(fname) File "/usr/lib64/python2.7/site-packages/egglib/data.py", line 767, in __init__ if not os.path.isfile(fname): raise ValueError, 'cannot open %s' %fname ValueError: cannot open data6.fas (Of course in that case the source of the error can be identified with the message alone. In that case, the exception has been thrown by the :class:`~egglib.Align` constructor when the file was not found.) ------------ ABC sampling ------------ In this example we will consider a simulated dataset of 10 independent alignments of 1000 nucleotides, simulated under a model containing 2 isolated populations of equal size and connected with a bidirectional migration rate (4Nm) of 0.5, with a mutation rate (4Nu) of 3.5 for gene (that is 0.0035 per site). 20 sequences were sampled in each population. We will see if it is possible to reject the standard neutral model (SNM) in favour of the (real) island model with, in that case, two demes (IM) and if it is possible to correctly estimate the parameters. We will first sample random parameters in uniform priors under these two alternative models (SNM and IM) and perform one simulation for each sample (one simulation contains the 10 loci). The command to do so is:: egglib abc_sample dir=fas params=SNM.txt data=SNM.out model=SNM prior="%U(0;0.01)" stats=SFS:4 The following arguments were entered: * *dir*: The name of a directory containing fasta files to analyze. * *params*: The name of the run parameter output file. * *data*: The name of data output file. * *model*: The label of the demographic model, here the Standard Neutral Model. * *prior*: The definition of the prior. Here, the prior is defined as a string (starting with a ``%`` character to indicate that the argument is not a file name), and specifies a uniform distribution bound between 0 and 0.01 (the SNM model has only one parameter). * *stats*: The set of summary statistics, here the Site Frequency Spectrum. This particular set of summary statistics expects an argument (the number of classes in the spectrum), which is passed after the ``:`` character. The following parameters were left to the default values: * *ext* (``fas``): This argument allows to set the extension of fasta files to process. * *post* (10000): The number of point to sample for posterior estimation. * *step* (100): The rate of refreshing of the progress information. * *seeds* (automatic): Random number generator seeds. * *restart*: This option should be used to resume an interrupted. It takes a run parameter file (the file named by the *params* option) and will attempt to finish the run. No other options should be entered along with *restart*. ---------------------------------- ABC rejection-regression procedure ---------------------------------- The ``abc_fit`` command performs the steps of the standard rejection-regression procedure described in Beaumont et al. 2002 Genetics using the underlying C++ class :class:`~egglib.egglib_binding.ABC`:: egglib abc_fit input=SNM.txt tolerance=0.1 transform=tan output=SNM.fit Here we filled all arguments. The input file is ``SNM.txt`` which actually contains only the values of the run parameters. The actual data file is given as one of the parameters and must be found in the same directory. The tolerance gives the proportion of points to select in the region closest to the observed statistics. This defines the local region and, in practice, should be more stringent than in this example. The fitting procedure can be applied to data from a unique model only. The transformation of parameter values (here, tangential transformation) is useful in particular to prevent parameters from exceeding prior bounds. The output file will contain only parameter values, selected and corrected, and will be considered as a posterior distribution. -------------------- Comparing ABC models -------------------- The rejection-based method of Fagundes et al. 2007 PNAS allows to compare alternative model without relying on Bayes Factors. It is based on the rejection step of the rejection-regression procedure, which doesn't use the parameter values and therefore allows to merge several models. The model probabilities are estimated as the proportion of points originating from each model in a pre-defined number of accepted points:: egglib abc_compare input=SNM.txt,PEM.txt tolerance=0.1 output=compare.txt The different sample files are passed in the *input* argument. The *tolerance* argument is the same as for the fitting procedure, but note that here it applies to the sum of points over all the passed models. The output file will contain the accepted points, along with their weigths. The model probabilities will be displayed in the standard output (the terminal). -------------------------------- Analyzing posterior distribution -------------------------------- A set of commands are designed to help analyzing posteriors, after they have been fitted. * ``abc_bin`` discretizes posteriors into a regular grid over all parameter dimensions. * ``abc_statsmarg`` provide statistics for individual parameters independently, from a fitted posterior. * ``abc_statsdisc`` (for models with more than one parameter), analyzes a discretized posterior (and, in particular, identifies the maximum density point of the distribution). * ``abc_plot1D`` and ``abc_plot2D`` display graphical representations parameters (respectively, a single parameter and the covariation of two parameters). They require that :mod:`matplotlib` is available and they use discretized posteriors. * ``abc_psimuls``, finally, allows to use posteriors to generate new simulations for testing neutrality of loci or assess goodness-of-fit of models.