Integration of external software

Principle

EggLib provides wrappers of external programs in order to extend its functionalities besides its own population genetics and sequence management tools (although providing a wide array of external tools is not one of the primary aims of EggLib).

The available tools are presented through an integrated interface: once configured, they can be used as any regular EggLib function, using objects from EggLib class as argument and/or return values.

Currently, the wrappers module provides:

  • Multiple sequence alignment using Clustal Omega and MUSCLE.

  • Maximum-likelihood phylogeny using PhyML.

  • Maximum-likelihood fitting and testing of coding sequence evolution using CodeML from the PAML package.

  • Distance-based phylogeny using dnadist, protdist and neighbor from the PHYLIP package.

  • Basic Local Alignment using blastn, blastp, blastx, tblastn, tblastx and makeblastdb from the BLAST+ standalone package.

The requested programs must be installed separately.

Usage

Configuration of application paths

By default, all applications are disabled. Any attempt to use any wrappers function will cause a RuntimeError as in the example below:

>>> import egglib
>>> cnt = egglib.io.from_fasta('sequences3.fas', cls=egglib.Container, alphabet=egglib.alphabets.DNA)
>>> aln = egglib.wrappers.clustal(cnt)
[traceback omitted]
RuntimeError: Clustal Omega program not available -- please configure path

There is a paths object in the wrappers module which is responsible for holding the command executing the external applications. The applications are identified by these keys:

Key

Application

clustal

Clustal Omega

muscle

MUSCLE

phyml

PhyML

codeml

CodeML (PAML package)

dnasdist

dnadist (PHYLIP)

protdist

protdist (PHYLIP)

neighbor

neighbor (PHYLIP)

blastn

blastn (BLAST+)

blastp

blastp (BLAST+)

blastx

blastx (BLAST+)

tblastn

tblastn (BLAST+)

tblastx

tblastx (BLAST+)

makeblastdb

makeblastdb (BLAST+)

We can see that the default value is None:

>>> print(egglib.wrappers.paths['clustal'])
None

This object can be configured using the command-line tools egglib-config, as described in Configuring external applications. The approach for editing paths from Python scripts is described below.

Auto-detection of applications

The method autodetect() tries a set of pre-defined commands and, if any works, set it as the command to run the corresponding application:

>>> egglib.wrappers.paths.autodetect(verbose=True)
> codeml: codeml --> ok
> phyml: phyml --> fail ("No such file or directory")
> clustal: clustalo --> fail ("No such file or directory")
> muscle: muscle --> ok

This method has a verbose option which provide you with information (success or failure, and message error in case of failure). In case of a failure, the corresponding application is disabled but this is not treated as an error.

In this example, the default commands codeml and muscle succeeded, so the two applications are now available. But the command for Clustal Omega was not successful, either because the program is not installed, or not installed under this name. As a result, it is still not available:

>>> print(egglib.wrappers.paths['clustal'])
None

Of course, this result is dependent on system configuration and properties.

Setting application paths manually

It is possible to specify a command string (including the full path of the program executable) for any application. Assume that we have compiled Clustal Omega, but did not install it in the binary search path (the executable remains in the user’s home directory). We can set the path as follows:

>>> egglib.wrappers.paths['clustal'] = '/home/user/data/software/clustal-omega-1.2.1/src/clustalo'
>>> print(egglib.wrappers.paths['clustal'])
/home/user/data/software/clustal-omega-1.2.1/src/clustalo

If the path is incorrect, or if the specified command does not run as expected for this software, an error will be caused.

Saving and loading paths

Once paths have been set satisfactorily, it is possible to save them in a system file in order to have them available for the next session. When EggLib is imported again, the saved paths will be loaded and the properly configured application will be immediately usable. In case a program has been removed, it will cause an error when attempting to run it. The command to save paths is:

>>> egglib.wrappers.paths.save()

If the paths has been modified during a session, it is still possible to reset the paths from the saved configuration, discarding any changes that have been applied since EggLib has been imported, by running:

>>> egglib.wrappers.paths.load()

Using applications in scripts

This manual and the reference manual explain how to use the functions calling the external applications and they are aimed at users who are already experienced with these programs. It can be necessary to refer their own documentation to understand the meaning of available options and returned data.

Note that, for all wrappers, there is a verbose option to control whether or not the program output should be displayed on screen or deleted.

PhyML

PhyML is a software for maximum-likelihood phylogenetic reconstruction using amino acid or nucleotide sequences. It is available through the function wrappers.phyml() (reference included in the documentation). The usage is fairly straightforward: it takes an Align object as argument and returns a tuple with two items (the phylogenetic tree as a Tree object and a dictionary of statistics, respectively):

>>> aln = egglib.io.from_fasta('align8.fas',alphabet=egglib.alphabets.DNA)
>>> tree, stats = egglib.wrappers.phyml(aln, model='HKY85')
>>> print(stats)
{'freqs': [0.2844, 0.19657, 0.22759, 0.29143], 'ti/tv': 4.0, 'pars': 6983, 'lk': -33416.88715, 'size': 5.44449}
>>> print(tree.newick())
(CasLYK3:0.01237505,CasLYK2:0.0,((FvLYK2:0.09853004,(MdLYK3:0.10732566,PpLYK3:0.05300631):0.02447224):0.05989021,(PtLYK3:0.15268966,((MtLYK8:0.14003722,(LjLYS7:0.08155174,(GmLYK3:0.04846372,CacLYK3:0.05809886):0.04392275):0.02845355):0.11447351,((VvLYK3:0.11542039,VvLYK2:0.08334382):0.06445392,(VvLYK1:0.12763572,((PtLYK2:0.05871896,PtLYK1:0.03968281):0.11719099,((((CacLYK1:0.0760468,(GmNFR1a:0.03511832,GmNFR1b:0.0287519):0.02035824):0.03344143,(LjNFR1a:0.07231443,(MtLYK2:0.04784824,(MtLYK3:0.07366331,PsSYM37:0.06907117):0.01683888):0.05128711):0.02178421):0.14129609,((CecLYK1:0.1342407,((MtLYK7:0.11288812,(CacLYK4:0.05284059,GmLYK2:0.048187):0.05687575):0.04162164,((MtLYK1:0.13291815,LjNFR1b:0.11749646):0.06448685,(MtLYK6:0.13391465,LjNFR1c:0.11699053):0.03439881):0.02680216):0.04837524):0.03794071,(CecLYK2:0.07421816,((GmLYK2b:0.05222029,CacLYK2:0.07144586):0.05212459,(LjLYS6:0.05742719,MtLYK9:0.10847179):0.02814181):0.06743708):0.01846505):0.02443726):0.05441246,(CasLYK1:0.10100211,(AtCERK1:0.35336624,(FvLYK1:0.12483414,((MdLYK2:0.05670659,MdLYK1:0.03964285):0.04192224,(PpLYK1:0.06030315,PpLYK2:0.05194511):0.02388192):0.03157062):0.04509856):0.01641018):0.02078968):0.02797866):0.04205346):0.06152622):0.05814891):0.02685265):0.02416823):0.14995695);

Besides the input sequence alignment, this function takes an array of option. Only one is required (model, the model name). See the reference manual for the function for details. Most of the options are passed to the software.

The phylogenetic tree is returned as the first item of the return value. It is an object of the EggLib class Tree, which supports a wide a range of editing operation (management of internal nodes, subtree extraction, search for phylogenetic clades). The second item is a dictionary of statistics including maximum-likelihood estimates of model parameters. See the function manual for details.

CodeML

CodeML is one of the program included in the PAML software package, a widely use collection of tools for the analysis of the evolution of biological sequences. The wrappers.codeml() function of EggLib focuses on the tools describing the evolution of coding sequences.

To be used, this tools requires an alignment of coding sequences (in particular with an alignment length which is a multiple of 3), and a phylogenetic tree whose leaves names match the sequence names. Let us assume that this is what we have with the alignment imported in the previous example. We also have built the phylogenetic with PhyML already. Let us check that the number of samples matches and that the alignment length is a multiple of 3:

>>> print(aln.ns, tree.num_leaves)
17 17
>>> print(aln.ls)
270

Note that it is also necessary to ensure that there are no stop codons in the alignment and that the list of names match exactly. In addition, the tree must not be rooted (the base must be a trifurcation, which is the standard for trees reconstructed by maximum likelihood). This being said, to run PhyML, we only need an Align object, a Tree object, and the name of one of the models:

>>> aln = egglib.tools.to_codons(aln)
>>> results = egglib.wrappers.codeml(aln, tree, 'M1a')
>>> print(results['lnL'])
-2081.576434
>>> print(results['omega'])
[0.0905, 1.0]
>>> print(results['freq'])
[0.84535, 0.15465]
>>> print(results['site_w']['postw'])
[0.091, 0.091, 0.091, 1.0, 0.092, 0.094, 0.091, 0.091, 0.092, 0.107, 0.158, 0.092, 0.112, 0.091, 0.311, 0.091, 0.091, 0.091, 0.091, 0.091, 0.093, 0.136, 0.753, 0.092, 0.603, 0.939, 0.091, 0.091, 0.38, 0.093, 0.091, 0.1, 0.093, 0.097, 0.091, 0.092, 1.0, 0.091, 0.091, 0.091, 0.997, 0.091, 0.093, 0.091, 0.322, 0.092, 0.091, 0.091, 0.091, 0.092, 0.092, 0.091, 0.092, 0.093, 0.091, 0.091, 0.092, 0.091, 0.091, 0.091, 0.129, 0.092, 0.127, 0.091, 0.091, 0.091, 0.092, 0.987, 0.091, 0.168, 0.092, 0.213, 0.689, 0.989, 0.399, 0.997, 0.992, 0.093, 0.134, 0.985, 0.092, 0.099, 0.925, 0.091, 0.104, 0.102, 0.091, 0.091, 0.092, 0.091]

Please refer to the manual of the wrappers.codeml() function for a detailed list of options (including the list of available models), and the list of returned data. The return value is a dict containing a large number of entries exposing most of the contents of CodeML output.

In this example we displayed the log-likelihood of the model (lnL), the non-synonymous to synonymous rate ratio, or \(\omega\) for the two rate categories (omega), the corresponding category frequencies (freq), and the list of posterior \(\omega\) per-site estimates for the amino acid sites (since the DNA alignment length is 270, there are 90 amino acid sites).

In the specific case of branch- and clade-models (with or without site variability), CodeML expects that the phylogenetic tree contains internal node labels to identify branches (#x, where x is an integer) and clades ($x) that should have a different \(\omega\) rate. If these labels are present in a Newick-formatted tree file, they will be imported properly by the constructor of Tree. Otherwise, it is also possible to add them dynamically using the methods of Tree objects (such as find_clade() and the property label of Node instances).

Clustal Omega and Muscle

These are two multiple sequence alignment programs both supporting nucleotide and amino acid sequences. In both cases, alignment of sequences provided as a Container object to a resulting Align object is fairly straightforward:

>>> aln1 = egglib.wrappers.clustal(cnt)
>>> aln2 = egglib.wrappers.muscle(cnt)

These examples leave all other options to default values. Actually, both wrappers provide a lot of options, thereby exposing the flexibility of these two software tools. The options differ significantly between them, reflecting their different paradigms. Please refer to the manual of the two functions wrappers.clustal() and wrappers.muscle() (and of course the user’s guide of the two programs) for a detailed list of the available options.

Note

There are separate wrappers for MUSCLE version 3 and version 5, named muscle3() and muscle5(). They have different sets of options because the versions of the program are different. MUSCLE 5 is the current, improved version, but the wrapper for version 3 is maintained in hope it might be useful. Only one can be configured, based on the version of the MUSCLE program provided. The generic-named function muscle() is an alias for the available version.