Integration of external software

Principle

EggLib provides wrappers of external programs in order to extend its functionalities besides its own population genetics and sequence management tools (although provided a wide array of external tools is not one of the primary aims of EggLib).

The available tools are presented through an integrated interface: once configured, they can be used as any regular EggLib function, using objects from EggLib class as argument and/or return values.

Currently, the wrappers module provides:

  • Multiple sequence alignment using Clustal Omega and MUSCLE.

  • Maximum-likelihood phylogeny using PhyML.

  • Maximum-likelihood fitting and testing of coding sequence evolution using CodeML from the PAML package.

  • Distance-based phylogeny using dnadist, protdist and neighbor from the PHYLIP package.

  • Basic Local Alignment using blastn, blastp, blastx, tblastn, tblastx and makeblastdb from the BLAST+ standalone package.

Usage

Configuration of application paths

By default, all applications are disabled. Any attempt to use any wrappers function will cause a RuntimeError as in the example below:

>>> import egglib
>>> cnt = egglib.io.from_fasta('sequences3.fas', cls=egglib.Container, alphabet=egglib.alphabets.DNA)
>>> aln = egglib.wrappers.clustal(cnt)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.8/site-packages/egglib/wrappers/_utils.py", line 65, in _f
    return f(*args, **kwargs)
  File "/home/user/.local/lib/python3.8/site-packages/egglib/wrappers/_clustal.py", line 135, in clustal
    raise RuntimeError('Clustal Omega program not available -- please configure path')
RuntimeError: Clustal Omega program not available -- please configure path

There is a paths object in the wrappers module which is responsible for holding the command executing the external applications. The applications are identified by these keys:

Key

Application

clustal

Clustal Omega

muscle

MUSCLE

phyml

PhyML

codeml

CodeML (PAML package)

dnasdist

dnadist (PHYLIP)

protdist

protdist (PHYLIP)

neighbor

neighbor (PHYLIP)

blastn

blastn (BLAST+)

blastp

blastp (BLAST+)

blastx

blastx (BLAST+)

tblastn

tblastn (BLAST+)

tblastx

tblastx (BLAST+)

makeblastdb

makeblastdb (BLAST+)

We can see that the default value is None:

>>> print(egglib.wrappers.paths['clustal'])
None

There are two ways to configure this object, described in the following two paragraphs.

Auto-detection of applications

One is the method autodetect(). This method tries a set of pre-defined commands and, if any works, set it as the command to run the corresponding application:

>>> egglib.wrappers.paths.autodetect(verbose=True)
> codeml: codeml --> ok
> phyml: phyml --> fail ("No such file or directory")
> clustal: clustalo --> fail ("No such file or directory")
> muscle: muscle --> ok

This method has a verbose option which provide you with information (success or failure, and message error in case of failure). In case of a failure, the corresponding application is disabled but this is not treated as an error.

In this example, the default commands codeml and muscle succeeded, so the two applications are now available. But the command for Clustal Omega was not successful, either because the program is not installed, or not installed under this name. As a result, it is still not available:

>>> print(egglib.wrappers.paths['clustal'])
None

Of course, this result is dependent on the current status of the system.

Setting application paths manually

It is possible to specify a command string (including the full path of the program executable) for any application. Assume that we have compiled Clustal Omega, but did not install it in the binary search path (it is just lying in the home folder of one of the authors). We can set the paths as follows:

>>> egglib.wrappers.paths['clustal'] = '/home/user/data/software/clustal-omega-1.2.1/src/clustalo'
>>> print(egglib.wrappers.paths['clustal'])
/home/user/data/software/clustal-omega-1.2.1/src/clustalo

If the path is incorrect, or if the specified command does not run as expected for this software, an error will be caused.

Saving and loading paths

Once paths have been set satisfactorily, it is possible to save them in a system file in order to have them available for the next session. When EggLib is imported again, the saved paths will be loaded and the properly configured application will be immediately usable. In case a program has been removed, it will cause an error when attempting to run it. The command to save paths is:

>>> egglib.wrappers.paths.save()

Since it is written within EggLib installation (in the current version), it might requires administration rights.

If the paths has been modified during a session, it is still possible to reset the paths from the saved configuration, discarding any changes that have been applied since EggLib has been imported, by running: egglib.wrappers.paths.load().

Using applications in scripts

This manual and the reference manual explain how to use the functions calling the external applications and they are aimed at users who are already experienced with these programs. It can be necessary to refer their own documentation to understand the meaning of available options and returned data.

Note that, for all wrappers, there is a verbose option to control whether or not the program output should be displayed on screen or deleted.

PhyML

PhyML is a software for maximum-likelihood phylogenetic reconstruction using amino acid or nucleotide sequences. It is available through the function wrappers.phyml() (reference included in the documentation). The usage is fairly straightforward: it takes an Align object as argument and returns a tuple with two items (the phylogenetic tree as a Tree object and a dictionary of statistics, respectively):

>>> aln = egglib.io.from_fasta('align8.fas',alphabet=egglib.alphabets.DNA)
>>> tree, stats = egglib.wrappers.phyml(aln, model='HKY85')
>>> print(stats)
{'freqs': [0.2844, 0.19657, 0.22759, 0.29143], 'ti/tv': 4.0, 'pars': 6983, 'lk': -33416.88715, 'size': 5.44449}
>>> print(tree.newick())
(CasLYK3:0.01237505,CasLYK2:0.0,((FvLYK2:0.09853004,(MdLYK3:0.10732566,PpLYK3:0.05300631):0.02447224):0.05989021,(PtLYK3:0.15268966,((MtLYK8:0.14003722,(LjLYS7:0.08155174,(GmLYK3:0.04846372,CacLYK3:0.05809886):0.04392275):0.02845355):0.11447351,((VvLYK3:0.11542039,VvLYK2:0.08334382):0.06445392,(VvLYK1:0.12763572,((PtLYK2:0.05871896,PtLYK1:0.03968281):0.11719099,((((CacLYK1:0.0760468,(GmNFR1a:0.03511832,GmNFR1b:0.0287519):0.02035824):0.03344143,(LjNFR1a:0.07231443,(MtLYK2:0.04784824,(MtLYK3:0.07366331,PsSYM37:0.06907117):0.01683888):0.05128711):0.02178421):0.14129609,((CecLYK1:0.1342407,((MtLYK7:0.11288812,(CacLYK4:0.05284059,GmLYK2:0.048187):0.05687575):0.04162164,((MtLYK1:0.13291815,LjNFR1b:0.11749646):0.06448685,(MtLYK6:0.13391465,LjNFR1c:0.11699053):0.03439881):0.02680216):0.04837524):0.03794071,(CecLYK2:0.07421816,((GmLYK2b:0.05222029,CacLYK2:0.07144586):0.05212459,(LjLYS6:0.05742719,MtLYK9:0.10847179):0.02814181):0.06743708):0.01846505):0.02443726):0.05441246,(CasLYK1:0.10100211,(AtCERK1:0.35336624,(FvLYK1:0.12483414,((MdLYK2:0.05670659,MdLYK1:0.03964285):0.04192224,(PpLYK1:0.06030315,PpLYK2:0.05194511):0.02388192):0.03157062):0.04509856):0.01641018):0.02078968):0.02797866):0.04205346):0.06152622):0.05814891):0.02685265):0.02416823):0.14995695);

Besides the input sequence alignment, this function takes an array of option. Only one is required (model, the model name). See the reference manual for the function for details. Most of the options are passed to the software.

The phylogenetic tree is returned as the first item of the return value. It is an object of the EggLib class Tree, which supports a wide a range of editing operation (management of internal nodes, subtree extraction, search for phylogenetic clades). The second item is a dictionary of statistics including maximum-likelihood estimates of model parameters. See the function manual for details.

CodeML

CodeML is one of the program included in the PAML software package, a widely use collection of tools for the analysis of the evolution of biological sequences. The wrappers.codeml() function of EggLib focuses on the tools describing the evolution of coding sequences.

To be used, this tools requires an alignment of coding sequences (in particular with an alignment length which is a multiple of 3), and a phylogenetic tree whose leaves names match the sequence names. Let us assume that this is what we have with the alignment imported in the previous alignment. We also have built the phylogenetic with PhyML already. Let us check that the number of samples matches and that the alignment length is a multiple of 3:

>>> print(aln.ns, tree.num_leaves)
17 17
>>> print(aln.ls)
270

Note that it is also necessary to ensure that there is no stop codons in the alignment and that the list of names match exactly. In addition, the tree must not be rooted (the base must be a trifurcation, which is necessary for all tree reconstructed by maximum likelihood anyway). This being said, to run PhyML, we only need an Align object, a Tree object, and the name of one of the models:

>>> aln = egglib.tools.to_codons(aln)
>>> results = egglib.wrappers.codeml(aln, tree, 'M1a')
>>> print(results['lnL'])
-2081.576434
>>> print(results['omega'])
[0.0905, 1.0]
>>> print(results['freq'])
[0.84535, 0.15465]
>>> print(results['site_w']['postw'])
[0.091, 0.091, 0.091, 1.0, 0.092, 0.094, 0.091, 0.091, 0.092, 0.107, 0.158, 0.092, 0.112, 0.091, 0.311, 0.091, 0.091, 0.091, 0.091, 0.091, 0.093, 0.136, 0.753, 0.092, 0.603, 0.939, 0.091, 0.091, 0.38, 0.093, 0.091, 0.1, 0.093, 0.097, 0.091, 0.092, 1.0, 0.091, 0.091, 0.091, 0.997, 0.091, 0.093, 0.091, 0.322, 0.092, 0.091, 0.091, 0.091, 0.092, 0.092, 0.091, 0.092, 0.093, 0.091, 0.091, 0.092, 0.091, 0.091, 0.091, 0.129, 0.092, 0.127, 0.091, 0.091, 0.091, 0.092, 0.987, 0.091, 0.168, 0.092, 0.213, 0.689, 0.989, 0.399, 0.997, 0.992, 0.093, 0.134, 0.985, 0.092, 0.099, 0.925, 0.091, 0.104, 0.102, 0.091, 0.091, 0.092, 0.091]

Please refer to the manual of the wrappers.codeml() function for a detailed list of options (including the list of available models), and the list of returned data. The return value is a dict containing a large number of entries exposing most of the contents of CodeML output.

In this example we displayed the log-likelihood of the model (lnL), the non-synonymous to synonymous rate ratio, or \(\omega\) for the two rate categories (omega), the corresponding category frequencies (freq), and the list of posterior \(\omega\) per-site estimates for the amino acid sites (since the DNA alignment length is 270, there are 90 amino acid sites).

In the specific case of branch- and clade-models (with or without site variable), CodeML expects that the phylogenetic tree contains internal node labels to identify branches (#x, where x is an integer) and clades ($x) that should have a different \(\omega\) rate. If these labels are present in a Newick-formatted tree file, they will be imported properly by the constructor of Tree. Otherwise, it is also possible to add them dynamically using the methods of Tree objects (such as find_clade() and the property label of Node instances).

Clustal Omega and Muscle

These are two multiple sequence alignment programs both supporting nucleotide and amino acid sequences, and both able to refine (or add sequences to) existing alignments. In both cases, alignment of sequences provided as a Container object to a resulting Align object is fairly straightforward:

>>> aln1 = egglib.wrappers.clustal(cnt)
>>> aln2 = egglib.wrappers.muscle(cnt)

These examples leave all other options to default values. Actually, both wrappers provide a lot of options, thereby exposing the flexibility of these two software tools. The options differ significantly between them, reflecting their different paradigms. Please refer to the manual of the two functions wrappers.clustal() and wrappers.muscle() (and of course the user’s guide of the two programs) for a detailed list of the available options.