Performing simulations¶

Simplest case¶

The most straighforward way to perform coalescent simulations is to run the simul() method of Simulator instances. This method uses the current value of parameters and returns a brand new Align instance containing simulated data. This instance will always have the desired number of samples. If the infinite site model of mutation is used (num_sites=0, by default), then the number of sites can vary, as shown by the two consecutive simulations with the same model below:

>>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[5], theta=5.0)
>>> aln = c.simul()
>>> print(aln.ns, aln.ls)
5 27
>>> print(aln.fasta(alphabet=egglib.alphabets.DNA))
>
ACCCCCCACAAAAAAAACCAAACCACC
>
ACCCACCACAAAAAAAACCAAACCCAC
>
CAAAAAACACCCCCCCCAACCCAAAAA
>
ACCCACCACAAAAAAAACCAAACCCAC
>
ACCCACCACAAAAAAAACCAAACCCAC
>>> aln = c.simul()
>>> print(aln.ns, aln.ls)
5 2

Note that, to print sequences, we used the alphabet option of fasta(). This is because simulated alleles are (in this case, with the number of alleles fixed to 2) the integers 0 and 1, which is not supported by default by fasta().

If the number of sites is fixed by num_sites, then the number of sites will be fixed, but it will not be guaranteed that all sites will be variable. Here is an example with a low \(\theta\), 4 alleles, and the number of sites set to 50, to simulate a short fragment of DNA:

c.params['theta'] = 1.0
c.params['num_alleles'] = 4
c.params['num_sites'] = 50
aln = c.simul()
print(aln.ns, aln.ls)
5 50
print(aln.fasta(alphabet=egglib.alphabets.DNA))
>
AACAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAGAAAAAAAAAAAA
>
TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Model with populations, individuals, and outgroup¶

You may have noticed that the resulting alignment has empty names. However it has two levels of group labels (identifying respectively populations and individuals, in that order). This is demonstrated below:

>>> c = egglib.coalesce.Simulator(num_pop=3, num_indiv=[4, 4, 1], theta=5.0)
>>> c.params.add_event(cat='merge', T=0.5, src=0, dst=1)
>>> c.params.add_event(cat='merge', T=3, src=2, dst=1)
>>> aln = c.simul()
>>> print(aln.ns, aln.ls)
16 48
>>> print(aln.fasta(alphabets=egglib.alphabets.DNA, labels=True))
>@0,0
CCAAACAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,0
CCCAAAAAACAACAAAAAAAAAAAAACCCAACACCAAAAAAAAACAA
>@0,1
CCAAAACAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,1
CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,2
CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,2
CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,3
CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@0,3
CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA
>@1,4
CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA
>@1,4
CCAAAAAAACACCAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA
>@1,5
CCAAAAAAACCACAAAAAAAAAAAAAAACAAAAACCAAAAACCACAA
>@1,5
CCAACAAAACAACACAAAAAACAAAAAACACAAACCAAAAAACACAA
>@1,6
CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA
>@1,6
CCAAAAAAACAACAAAAACAAAACAAAACAAAAACCAAAAAACCCAA
>@1,7
CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA
>@1,7
CCAAAAAAACAACCAAAAAAAAACAAAACAAAAACCAAAAAACCCAA
>@2,8
AAACAAACCAAAAAACCCACCACACCAAACAAAAAAACCCCAAAACC
>@2,8
AAACAAACCAAAAAACCAACCACACCAAACAACAAACCCCCAAAACC

All Align instances returned by Simulator have these two levels of structure.

The iterator¶

The simul() method of Simulator returns a brand new Align instance at each call, which represents a significant computational burden when performing a lot of simulations if each alignment only needs to be processed once and does not need to be stored. Simulator instances have another method, iter_simul(), that is much more efficient since it reuses constantly the same Align instance.

Currently, the expression:

>>> for aln in c.iter_simuls(1000):
...

is above 50 times faster than:

>>> for i in range(1000):
...     aln = c.simul()
...

But it should be kept in mind that the aln variable will actually always point to the same instance (which is a static alignment stored in the Simulator instance, and which is automatically reset at the end of the loop). Printing alignments demonstrate it:

>>> aln_list = []
>>> for i in range(5):
...     aln = c.simul()
...     aln_list.append(aln)
...
>>> print(map(hash, aln_list))
[8733823164941, 8733823165037, 8733823165093, 8733823165109, 8733823165125]

>>> del aln_list[:]
>>> for aln in c.iter_simul(5):
...     aln_list.append(aln)
...
>>> print(map(hash, aln_list))
[8733823165089, 8733823165089, 8733823165089, 8733823165089, 8733823165089]

Coupling simulations and statistics¶

iter_simul() can be used to chain diversity analysis with coalescent simulations. You need to create your own ComputeStats instance and specify what statistics you want to be computed, then pass it to iter_simul() as the cs argument. Instead of the alignment, a dictionary of statistics will be computed:

>>> cs = egglib.stats.ComputeStats()
>>> cs.add_stats('D', 'S')
>>> for stats in c.iter_simul(10, cs=cs):
...     print(stats)
...
{'S': 24, 'D': -0.20644191669019338}
{'S': 26, 'D': 0.34339003894229597}
{'S': 34, 'D': 0.3435768693207146}
{'S': 15, 'D': 0.21333685716317954}
{'S': 37, 'D': 1.065702801735725}
{'S': 24, 'D': -0.2017428852888509}
{'S': 23, 'D': 0.4068741818892754}
{'S': 14, 'D': 0.9424874038796626}
{'S': 34, 'D': -1.3365645916387778}
{'S': 27, 'D': -1.3531206659089505}

Within the iteration loop, it is still possible to retrieve the static alignment (even if the loop returns dictionaries of statistics), through the instance property align.

Varying parameters over simulations¶

It is possible to perform coalescent simulations with variable parameters (a different value for each replicate), using the feed_params feature of iter_simul(). To use it, keyword arguments must be supplied with lists of parameter values matching parameter names.

For example, to run the simplest model with one constant population with a variable value for the theta parameter, this is how it would be done (assuming the varying theta values have been drawn previously randomly, or provided externally):

>>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[20])
>>> theta_values = [2.5645, 4.4111, 6.5677, 1.8904, 2.1915, 0.9696, 2.8418, 5.221, 4.9423, 9.0793]
>>> for aln in c.iter_simul(10, theta=theta_values):
...     print(aln.ls)
13
13
42
8
15
1
11
12
12
32

Several parameters can be provided at once (just use as many keyword arguments as needed). Unfortunately, it is currently not possible to set this way population-specific parameters or arguments of historical events.

Accessing other data¶

Getting simulated trees¶

The genealogical trees representing the history of sample can be retrieved for all simulations. To do it, one must provide a list to iter_simul() and the list will be filled by the genealogical trees of each replicate. The trees are represented by instances of the class Tree which provides an array of functionality (see the reference manual for details).

In case recombination occurs, there can be multiple trees for a given simulation: one tree for each non-recombination segment. For this reason, each simulation is represented by a list of (tree, start, stop) tuple instances in the resulting list (start and stop positions are bounded by 0 and 1). This will be probably clearer with an example:

>>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[10], recomb=1.0, theta=0.0)
>>> trees = []
>>> for aln in c.iter_simul(100, dest_trees=trees):
...     pass

>>> print(len(trees[0]))
3
>>> for tree, start, stop in trees[0]:
...     print('segment')
...     print('  ', start)
...     print('  ', stop)
...     print('  ', tree.newick())
...
segment
   0.0
   0.140169298276
   (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.0724922358078,1:0.112063630199):0.0558628762791,((3:0.0495320422666,(7:0.00261506458809,6:0.00261506458809):0.0469169776785):0.00203223981633,4:0.0515642820829):0.116362224395):0.188555286584,(9:0.0369467779977,0:0.0369467779977):0.319535015065);
segment
   0.936298474669
   1.0
   (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.128355112087,(3:0.0515642820829,4:0.0515642820829):0.116362224395):0.0797638476642,(1:0.0470800821279,(7:0.00261506458809,6:0.00261506458809):0.0444650175398):0.200610272014):0.10879143892,(9:0.0369467779977,0:0.0369467779977):0.319535015065);
segment
   0.140169298276
   0.936298474669
   (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.128355112087,((3:0.0495320422666,(7:0.00261506458809,6:0.00261506458809):0.0469169776785):0.00203223981633,4:0.0515642820829):0.116362224395):0.0797638476642,1:0.247690354142):0.10879143892,(9:0.0369467779977,0:0.0369467779977):0.319535015065);

In this example, the first simulation went through two events of recombination, yielding three segments with breakpoints at approximate positions 0.1402 and 0.9363. The Newick representation of each of the three trees for the first simulation are displayed. Notice that the trees are not sorted within the per-simulation list (the order is defined by the order of the recombination events in the simulation).

If recombination is null, then each simulation is a one-item list with a single tuple with start equal to 0 and stop equal to 1.

Getting the structure of the model¶

It can be useful to have the Structure instance describing the population and individual structure of the alignments that will be generated by simulations (in order to compute diversity statistics). It is possible to do it using any Align instance, but this is not necessarily convenient (in particular if you plan to use the cs option of iter_simul()). The attribute params provides a method to generate the Structure describing the population and individual structure based on model parameters:

>>> c = egglib.coalesce.Simulator(num_pop=3, num_indiv=[4, 4, 1], theta=5.0)
>>> c.params.add_event(cat='merge', T=0.5, src=0, dst=1)
>>> c.params.add_event(cat='merge', T=3, src=2, dst=1)
>>> struct = c.params.mk_structure(outgroup_label='2')
>>> print(struct.as_dict())
({None: {'0': {'0': [0, 1], '1': [2, 3], '2': [4, 5], '3': [6, 7]}, '1': {'4': [8, 9], '5': [10, 11], '6': [12, 13], '7': [14, 15]}}}, {'8': [16, 17]})

>>> cs = egglib.stats.ComputeStats()
>>> cs.add_stats('FstWC', 'Fis')
>>> cs.set_structure(struct)
>>> for stats in c.iter_simul(10, cs=cs):
...     print(stats)
{'FstWC': 0.4208284939992256, 'Fis': 0.20152091254752857}
{'FstWC': 0.5204081632653061, 'Fis': 0.40327868852459026}
{'FstWC': 0.7218045112781954, 'Fis': 0.5672727272727273}
{'FstWC': 0.5411255411255411, 'Fis': 0.4494382022471909}
{'FstWC': 0.6054421768707483, 'Fis': 0.6181818181818182}
{'FstWC': 0.4155844155844157, 'Fis': 0.39607843137254906}
{'FstWC': 0.7689594356261023, 'Fis': 0.762402088772846}
{'FstWC': 0.6802721088435374, 'Fis': 0.5980861244019139}
{'FstWC': 0.34833659491193747, 'Fis': 0.378132118451025}
{'FstWC': 0.2332361516034986, 'Fis': 0.11363636363636342}

The fact that both \(F_{IS}\) and Weir and Cockerham’s \(\hat{\theta}\) (labelled FstWC) can be computed shows that the two individual and population levels have been properly described.

If you want to add a cluster level, or, for example, want to analyse the data using a different structure than the one used for simulations, you need to create your own Structure instance.

Note

If you mix sampled individuals and sampled chromosomes (non-zero values for any num_chrom and any num_indiv items), it will not be possible to process the individual level due to non-constant ploidy. In this case (to process the population level anyway), use the skip_indiv option of mk_structure().