Integration of external software¶
Principle¶
EggLib provides wrappers of external programs in order to extend its functionalities besides its own population genetics and sequence management tools (although providing a wide array of external tools is not one of the primary aims of EggLib).
The available tools are presented through an integrated interface: once configured, they can be used as any regular EggLib function, using objects from EggLib class as argument and/or return values.
Currently, the wrappers module provides:
Multiple sequence alignment using Clustal Omega and MUSCLE.
Maximum-likelihood phylogeny using PhyML.
Maximum-likelihood fitting and testing of coding sequence evolution using
CodeML
from the PAML package.Distance-based phylogeny using
dnadist
,protdist
andneighbor
from the PHYLIP package.Basic Local Alignment using
blastn
,blastp
,blastx
,tblastn
,tblastx
andmakeblastdb
from the BLAST+ standalone package.
The requested programs must be installed separately.
Usage¶
Configuration of application paths¶
By default, all applications are disabled. Any attempt to use any wrappers
function will cause a RuntimeError
as in the example below:
>>> import egglib
>>> cnt = egglib.io.from_fasta('sequences3.fas', cls=egglib.Container, alphabet=egglib.alphabets.DNA)
>>> aln = egglib.wrappers.clustal(cnt)
[traceback omitted]
RuntimeError: Clustal Omega program not available -- please configure path
There is a paths object in the wrappers module which is responsible for holding the command executing the external applications. The applications are identified by these keys:
Key |
Application |
---|---|
|
Clustal Omega |
|
MUSCLE |
|
PhyML |
|
CodeML (PAML package) |
|
dnadist (PHYLIP) |
|
protdist (PHYLIP) |
|
neighbor (PHYLIP) |
|
blastn (BLAST+) |
|
blastp (BLAST+) |
|
blastx (BLAST+) |
|
tblastn (BLAST+) |
|
tblastx (BLAST+) |
|
makeblastdb (BLAST+) |
We can see that the default value is None
:
>>> print(egglib.wrappers.paths['clustal'])
None
This object can be configured using the command-line tools egglib-config
,
as described in Configuring external applications. The approach for editing paths from Python
scripts is described below.
Auto-detection of applications¶
The method autodetect()
tries a set of
pre-defined commands and, if any works, set it as the command to run the
corresponding application:
>>> egglib.wrappers.paths.autodetect(verbose=True)
> codeml: codeml --> ok
> phyml: phyml --> fail ("No such file or directory")
> clustal: clustalo --> fail ("No such file or directory")
> muscle: muscle --> ok
This method has a verbose option which provide you with information (success or failure, and message error in case of failure). In case of a failure, the corresponding application is disabled but this is not treated as an error.
In this example, the default commands codeml
and muscle
succeeded, so the
two applications are now available. But the command for Clustal Omega was
not successful, either because the program is not installed, or not installed under
this name. As a result, it is still not available:
>>> print(egglib.wrappers.paths['clustal'])
None
Of course, this result is dependent on system configuration and properties.
Setting application paths manually¶
It is possible to specify a command string (including the full path of the program executable) for any application. Assume that we have compiled Clustal Omega, but did not install it in the binary search path (the executable remains in the user’s home directory). We can set the path as follows:
>>> egglib.wrappers.paths['clustal'] = '/home/user/data/software/clustal-omega-1.2.1/src/clustalo'
>>> print(egglib.wrappers.paths['clustal'])
/home/user/data/software/clustal-omega-1.2.1/src/clustalo
If the path is incorrect, or if the specified command does not run as expected for this software, an error will be caused.
Saving and loading paths¶
Once paths have been set satisfactorily, it is possible to save them in a system file in order to have them available for the next session. When EggLib is imported again, the saved paths will be loaded and the properly configured application will be immediately usable. In case a program has been removed, it will cause an error when attempting to run it. The command to save paths is:
>>> egglib.wrappers.paths.save()
If the paths has been modified during a session, it is still possible to reset the paths from the saved configuration, discarding any changes that have been applied since EggLib has been imported, by running:
>>> egglib.wrappers.paths.load()
Using applications in scripts¶
This manual and the reference manual explain how to use the functions calling the external applications and they are aimed at users who are already experienced with these programs. It can be necessary to refer their own documentation to understand the meaning of available options and returned data.
Note that, for all wrappers, there is a verbose
option to control
whether or not the program output should be displayed on screen or deleted.
PhyML¶
PhyML is a software for maximum-likelihood phylogenetic reconstruction
using amino acid or nucleotide sequences. It is available through the
function wrappers.phyml()
(reference included in the documentation).
The usage is fairly straightforward: it takes an Align
object
as argument and returns a tuple
with two items (the phylogenetic
tree as a Tree
object and a dictionary of statistics, respectively):
>>> aln = egglib.io.from_fasta('align8.fas',alphabet=egglib.alphabets.DNA)
>>> tree, stats = egglib.wrappers.phyml(aln, model='HKY85')
>>> print(stats)
{'freqs': [0.2844, 0.19657, 0.22759, 0.29143], 'ti/tv': 4.0, 'pars': 6983, 'lk': -33416.88715, 'size': 5.44449}
>>> print(tree.newick())
(CasLYK3:0.01237505,CasLYK2:0.0,((FvLYK2:0.09853004,(MdLYK3:0.10732566,PpLYK3:0.05300631):0.02447224):0.05989021,(PtLYK3:0.15268966,((MtLYK8:0.14003722,(LjLYS7:0.08155174,(GmLYK3:0.04846372,CacLYK3:0.05809886):0.04392275):0.02845355):0.11447351,((VvLYK3:0.11542039,VvLYK2:0.08334382):0.06445392,(VvLYK1:0.12763572,((PtLYK2:0.05871896,PtLYK1:0.03968281):0.11719099,((((CacLYK1:0.0760468,(GmNFR1a:0.03511832,GmNFR1b:0.0287519):0.02035824):0.03344143,(LjNFR1a:0.07231443,(MtLYK2:0.04784824,(MtLYK3:0.07366331,PsSYM37:0.06907117):0.01683888):0.05128711):0.02178421):0.14129609,((CecLYK1:0.1342407,((MtLYK7:0.11288812,(CacLYK4:0.05284059,GmLYK2:0.048187):0.05687575):0.04162164,((MtLYK1:0.13291815,LjNFR1b:0.11749646):0.06448685,(MtLYK6:0.13391465,LjNFR1c:0.11699053):0.03439881):0.02680216):0.04837524):0.03794071,(CecLYK2:0.07421816,((GmLYK2b:0.05222029,CacLYK2:0.07144586):0.05212459,(LjLYS6:0.05742719,MtLYK9:0.10847179):0.02814181):0.06743708):0.01846505):0.02443726):0.05441246,(CasLYK1:0.10100211,(AtCERK1:0.35336624,(FvLYK1:0.12483414,((MdLYK2:0.05670659,MdLYK1:0.03964285):0.04192224,(PpLYK1:0.06030315,PpLYK2:0.05194511):0.02388192):0.03157062):0.04509856):0.01641018):0.02078968):0.02797866):0.04205346):0.06152622):0.05814891):0.02685265):0.02416823):0.14995695);
Besides the input sequence alignment, this function takes an array of option.
Only one is required (model
, the model name). See the reference manual
for the function for details. Most of the options are passed to the software.
The phylogenetic tree is returned as the first item of the return value.
It is an object of the EggLib class Tree
, which supports a wide
a range of editing operation (management of internal nodes, subtree extraction,
search for phylogenetic clades). The second item is a dictionary of statistics
including maximum-likelihood estimates of model parameters. See the function
manual for details.
CodeML¶
CodeML is one of the program included in the PAML software package, a widely use
collection of tools for the analysis of the evolution of biological sequences.
The wrappers.codeml()
function of EggLib focuses on the tools describing
the evolution of coding sequences.
To be used, this tools requires an alignment of coding sequences (in particular with an alignment length which is a multiple of 3), and a phylogenetic tree whose leaves names match the sequence names. Let us assume that this is what we have with the alignment imported in the previous example. We also have built the phylogenetic with PhyML already. Let us check that the number of samples matches and that the alignment length is a multiple of 3:
>>> print(aln.ns, tree.num_leaves)
17 17
>>> print(aln.ls)
270
Note that it is also necessary to ensure that there are no stop codons in the
alignment and that the list of names match exactly. In addition, the tree must
not be rooted (the base must be a trifurcation, which is the standard for
trees reconstructed by maximum likelihood). This being said, to
run PhyML, we only need an Align
object, a Tree
object,
and the name of one of the models:
>>> aln = egglib.tools.to_codons(aln)
>>> results = egglib.wrappers.codeml(aln, tree, 'M1a')
>>> print(results['lnL'])
-2081.576434
>>> print(results['omega'])
[0.0905, 1.0]
>>> print(results['freq'])
[0.84535, 0.15465]
>>> print(results['site_w']['postw'])
[0.091, 0.091, 0.091, 1.0, 0.092, 0.094, 0.091, 0.091, 0.092, 0.107, 0.158, 0.092, 0.112, 0.091, 0.311, 0.091, 0.091, 0.091, 0.091, 0.091, 0.093, 0.136, 0.753, 0.092, 0.603, 0.939, 0.091, 0.091, 0.38, 0.093, 0.091, 0.1, 0.093, 0.097, 0.091, 0.092, 1.0, 0.091, 0.091, 0.091, 0.997, 0.091, 0.093, 0.091, 0.322, 0.092, 0.091, 0.091, 0.091, 0.092, 0.092, 0.091, 0.092, 0.093, 0.091, 0.091, 0.092, 0.091, 0.091, 0.091, 0.129, 0.092, 0.127, 0.091, 0.091, 0.091, 0.092, 0.987, 0.091, 0.168, 0.092, 0.213, 0.689, 0.989, 0.399, 0.997, 0.992, 0.093, 0.134, 0.985, 0.092, 0.099, 0.925, 0.091, 0.104, 0.102, 0.091, 0.091, 0.092, 0.091]
Please refer to the manual of the wrappers.codeml()
function for
a detailed list of options (including the list of available models), and
the list of returned data. The return value is a dict
containing
a large number of entries exposing most of the contents of CodeML output.
In this example we displayed the log-likelihood of the model (lnL
), the
non-synonymous to synonymous rate ratio, or \(\omega\) for the two rate
categories (omega
), the corresponding category frequencies (freq
),
and the list of posterior \(\omega\) per-site estimates for the
amino acid sites (since the DNA alignment length is 270, there are 90
amino acid sites).
In the specific case of branch- and clade-models (with or without site
variability), CodeML expects that the phylogenetic tree contains internal
node labels to identify branches (#x
, where x
is an integer) and
clades ($x
) that should have a different \(\omega\) rate. If these
labels are present in a Newick-formatted tree file, they will be imported
properly by the constructor of Tree
. Otherwise, it is also
possible to add them dynamically using the methods of Tree
objects
(such as find_clade()
and the property label
of Node
instances).
Clustal Omega and Muscle¶
These are two multiple sequence alignment programs both supporting nucleotide
and amino acid sequences. In both cases, alignment of sequences provided as
a Container
object to a resulting Align
object is fairly
straightforward:
>>> aln1 = egglib.wrappers.clustal(cnt)
>>> aln2 = egglib.wrappers.muscle(cnt)
These examples leave all other options to default values. Actually, both
wrappers provide a lot of options, thereby exposing the flexibility of these
two software tools. The options differ significantly between them, reflecting
their different paradigms. Please refer to the manual of the two functions
wrappers.clustal()
and wrappers.muscle()
(and of course the
user’s guide of the two programs) for a detailed list of the available
options.
Note
There are separate wrappers for MUSCLE
version 3 and version 5,
named muscle3()
and muscle5()
. They
have different sets of options because the versions of the program
are different. MUSCLE
5 is the current, improved version, but the
wrapper for version 3 is maintained in hope it might be useful. Only
one can be configured, based on the version of the MUSCLE
program
provided. The generic-named function muscle()
is an alias for
the available version.