---------------------- Performing simulations ---------------------- Simplest case ============= The most straighforward way to perform coalescent simulations is to run the :meth:`~.Simulator.simul` method of :class:`~.coalesce.Simulator` instances. This method uses the current value of parameters and returns a brand new :class:`.Align` instance containing simulated data. This instance will always have the desired number of samples. If the infinite site model of mutation is used (``num_sites=0``, by default), then the number of sites can vary, as shown by the two consecutive simulations with the same model below:: >>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[5], theta=5.0) >>> aln = c.simul() >>> print(aln.ns, aln.ls) 5 27 >>> print(aln.fasta(alphabet=egglib.alphabets.DNA)) > ACCCCCCACAAAAAAAACCAAACCACC > ACCCACCACAAAAAAAACCAAACCCAC > CAAAAAACACCCCCCCCAACCCAAAAA > ACCCACCACAAAAAAAACCAAACCCAC > ACCCACCACAAAAAAAACCAAACCCAC >>> aln = c.simul() >>> print(aln.ns, aln.ls) 5 2 Note that, to print sequences, we used the *alphabet* option of :meth:`~.Align.fasta`. This is because simulated alleles are (in this case, with the number of alleles fixed to 2) the integers 0 and 1, which is not supported by default by :meth:`!fasta`. If the number of sites is fixed by ``num_sites``, then the number of sites will be fixed, but it will not be guaranteed that all sites will be variable. Here is an example with a low :math:`\theta`, 4 alleles, and the number of sites set to 50, to simulate a short fragment of DNA:: c.params['theta'] = 1.0 c.params['num_alleles'] = 4 c.params['num_sites'] = 50 aln = c.simul() print(aln.ns, aln.ls) 5 50 print(aln.fasta(alphabet=egglib.alphabets.DNA)) > AACAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAAGAAAAAAAAAAAA > TAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Model with populations, individuals, and outgroup ************************************************* You may have noticed that the resulting alignment has empty names. However it has two levels of group labels (identifying respectively populations and individuals, in that order). This is demonstrated below: .. image:: /pict/model13.* :height: 200px :width: 200 px :align: center :: >>> c = egglib.coalesce.Simulator(num_pop=3, num_indiv=[4, 4, 1], theta=5.0) >>> c.params.add_event(cat='merge', T=0.5, src=0, dst=1) >>> c.params.add_event(cat='merge', T=3, src=2, dst=1) >>> aln = c.simul() >>> print(aln.ns, aln.ls) 16 48 >>> print(aln.fasta(alphabets=egglib.alphabets.DNA, labels=True)) >@0,0 CCAAACAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,0 CCCAAAAAACAACAAAAAAAAAAAAACCCAACACCAAAAAAAAACAA >@0,1 CCAAAACAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,1 CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,2 CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,2 CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,3 CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@0,3 CCCAAAAAACAACAAAAAAAAAAAAAACCAACACCAAAAAAAAACAA >@1,4 CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA >@1,4 CCAAAAAAACACCAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA >@1,5 CCAAAAAAACCACAAAAAAAAAAAAAAACAAAAACCAAAAACCACAA >@1,5 CCAACAAAACAACACAAAAAACAAAAAACACAAACCAAAAAACACAA >@1,6 CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA >@1,6 CCAAAAAAACAACAAAAACAAAACAAAACAAAAACCAAAAAACCCAA >@1,7 CCAAAAAAACAACAAAAAAAAAACAAAACAAAAACCAAAAAACCCAA >@1,7 CCAAAAAAACAACCAAAAAAAAACAAAACAAAAACCAAAAAACCCAA >@2,8 AAACAAACCAAAAAACCCACCACACCAAACAAAAAAACCCCAAAACC >@2,8 AAACAAACCAAAAAACCAACCACACCAAACAACAAACCCCCAAAACC All :class:`!Align` instances returned by :class:`!Simulator` have these two levels of structure. The iterator ============ The :meth:`~.coalesce.Simulator.simul` method of :class:`!Simulator` returns a brand new :class:`!Align` instance at each call, which represents a significant computational burden when performing a lot of simulations if each alignment only needs to be processed once and does not need to be stored. :class:`!Simulator` instances have another method, :meth:`~.coalesce.Simulator.iter_simul`, that is much more efficient since it reuses constantly the same :class:`!Align` instance. Currently, the expression:: >>> for aln in c.iter_simuls(1000): ... is above 50 times faster than:: >>> for i in range(1000): ... aln = c.simul() ... But it should be kept in mind that the ``aln`` variable will actually always point to the same instance (which is a static alignment stored in the :class:`!Simulator` instance, and which is automatically reset at the end of the loop). Printing alignments demonstrate it:: >>> aln_list = [] >>> for i in range(5): ... aln = c.simul() ... aln_list.append(aln) ... >>> print(map(hash, aln_list)) [8733823164941, 8733823165037, 8733823165093, 8733823165109, 8733823165125] >>> del aln_list[:] >>> for aln in c.iter_simul(5): ... aln_list.append(aln) ... >>> print(map(hash, aln_list)) [8733823165089, 8733823165089, 8733823165089, 8733823165089, 8733823165089] Coupling simulations and statistics *********************************** :meth:`!iter_simul` can be used to chain diversity analysis with coalescent simulations. You need to create your own :class:`~.stats.ComputeStats` instance and specify what statistics you want to be computed, then pass it to :meth:`!iter_simul` as the *cs* argument. Instead of the alignment, a dictionary of statistics will be computed:: >>> cs = egglib.stats.ComputeStats() >>> cs.add_stats('D', 'S') >>> for stats in c.iter_simul(10, cs=cs): ... print(stats) ... {'S': 24, 'D': -0.20644191669019338} {'S': 26, 'D': 0.34339003894229597} {'S': 34, 'D': 0.3435768693207146} {'S': 15, 'D': 0.21333685716317954} {'S': 37, 'D': 1.065702801735725} {'S': 24, 'D': -0.2017428852888509} {'S': 23, 'D': 0.4068741818892754} {'S': 14, 'D': 0.9424874038796626} {'S': 34, 'D': -1.3365645916387778} {'S': 27, 'D': -1.3531206659089505} Within the iteration loop, it is still possible to retrieve the static alignment (even if the loop returns dictionaries of statistics), through the instance property :py:obj:`~.Simulator.align`. Varying parameters over simulations *********************************** It is possible to perform coalescent simulations with variable parameters (a different value for each replicate), using the ``feed_params`` feature of :meth:`!iter_simul`. To use it, keyword arguments must be supplied with lists of parameter values matching parameter names. For example, to run the simplest model with one constant population with a variable value for the ``theta`` parameter, this is how it would be done (assuming the varying ``theta`` values have been drawn previously randomly, or provided externally):: >>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[20]) >>> theta_values = [2.5645, 4.4111, 6.5677, 1.8904, 2.1915, 0.9696, 2.8418, 5.221, 4.9423, 9.0793] >>> for aln in c.iter_simul(10, theta=theta_values): ... print(aln.ls) 13 13 42 8 15 1 11 12 12 32 Several parameters can be provided at once (just use as many keyword arguments as needed). Unfortunately, it is currently not possible to set this way population-specific parameters or arguments of historical events. Accessing other data ==================== Getting simulated trees *********************** The genealogical trees representing the history of sample can be retrieved for all simulations. To do it, one must provide a :class:`list` to :meth:`~.coalesce.Simulator.iter_simul` and the list will be filled by the genealogical trees of each replicate. The trees are represented by instances of the class :class:`.Tree` which provides an array of functionality (see the reference manual for details). In case recombination occurs, there can be multiple trees for a given simulation: one tree for each non-recombination segment. For this reason, each simulation is represented by a :class:`!list` of ``(tree, start, stop)`` :class:`tuple` instances in the resulting list (start and stop positions are bounded by 0 and 1). This will be probably clearer with an example:: >>> c = egglib.coalesce.Simulator(num_pop=1, num_chrom=[10], recomb=1.0, theta=0.0) >>> trees = [] >>> for aln in c.iter_simul(100, dest_trees=trees): ... pass >>> print(len(trees[0])) 3 >>> for tree, start, stop in trees[0]: ... print('segment') ... print(' ', start) ... print(' ', stop) ... print(' ', tree.newick()) ... segment 0.0 0.140169298276 (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.0724922358078,1:0.112063630199):0.0558628762791,((3:0.0495320422666,(7:0.00261506458809,6:0.00261506458809):0.0469169776785):0.00203223981633,4:0.0515642820829):0.116362224395):0.188555286584,(9:0.0369467779977,0:0.0369467779977):0.319535015065); segment 0.936298474669 1.0 (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.128355112087,(3:0.0515642820829,4:0.0515642820829):0.116362224395):0.0797638476642,(1:0.0470800821279,(7:0.00261506458809,6:0.00261506458809):0.0444650175398):0.200610272014):0.10879143892,(9:0.0369467779977,0:0.0369467779977):0.319535015065); segment 0.140169298276 0.936298474669 (((((2:0.0180097202807,8:0.0180097202807):0.0215616741102,5:0.0395713943908):0.128355112087,((3:0.0495320422666,(7:0.00261506458809,6:0.00261506458809):0.0469169776785):0.00203223981633,4:0.0515642820829):0.116362224395):0.0797638476642,1:0.247690354142):0.10879143892,(9:0.0369467779977,0:0.0369467779977):0.319535015065); In this example, the first simulation went through two events of recombination, yielding three segments with breakpoints at approximate positions 0.1402 and 0.9363. The Newick representation of each of the three trees for the first simulation are displayed. Notice that the trees are not sorted within the per-simulation list (the order is defined by the order of the recombination events in the simulation). If recombination is null, then each simulation is a one-item :class:`!list` with a single :class:`!tuple` with ``start`` equal to 0 and ``stop`` equal to 1. .. _coalesce-structure: Getting the structure of the model ********************************** It can be useful to have the :class:`.Structure` instance describing the population and individual structure of the alignments that will be generated by simulations (in order to compute diversity statistics). It is possible to do it using any :class:`!Align` instance, but this is not necessarily convenient (in particular if you plan to use the *cs* option of :meth:`!iter_simul`). The attribute :attr:`!params` provides a method to generate the :class:`!Structure` describing the population and individual structure based on model parameters:: >>> c = egglib.coalesce.Simulator(num_pop=3, num_indiv=[4, 4, 1], theta=5.0) >>> c.params.add_event(cat='merge', T=0.5, src=0, dst=1) >>> c.params.add_event(cat='merge', T=3, src=2, dst=1) >>> struct = c.params.mk_structure(outgroup_label='2') >>> print(struct.as_dict()) ({None: {'0': {'0': [0, 1], '1': [2, 3], '2': [4, 5], '3': [6, 7]}, '1': {'4': [8, 9], '5': [10, 11], '6': [12, 13], '7': [14, 15]}}}, {'8': [16, 17]}) >>> cs = egglib.stats.ComputeStats() >>> cs.add_stats('FstWC', 'Fis') >>> cs.set_structure(struct) >>> for stats in c.iter_simul(10, cs=cs): ... print(stats) {'FstWC': 0.4208284939992256, 'Fis': 0.20152091254752857} {'FstWC': 0.5204081632653061, 'Fis': 0.40327868852459026} {'FstWC': 0.7218045112781954, 'Fis': 0.5672727272727273} {'FstWC': 0.5411255411255411, 'Fis': 0.4494382022471909} {'FstWC': 0.6054421768707483, 'Fis': 0.6181818181818182} {'FstWC': 0.4155844155844157, 'Fis': 0.39607843137254906} {'FstWC': 0.7689594356261023, 'Fis': 0.762402088772846} {'FstWC': 0.6802721088435374, 'Fis': 0.5980861244019139} {'FstWC': 0.34833659491193747, 'Fis': 0.378132118451025} {'FstWC': 0.2332361516034986, 'Fis': 0.11363636363636342} The fact that both :math:`F_{IS}` and Weir and Cockerham's :math:`\hat{\theta}` (labelled ``FstWC``) can be computed shows that the two individual and population levels have been properly described. If you want to add a cluster level, or, for example, want to analyse the data using a different structure than the one used for simulations, you need to create your own :class:`!Structure` instance. .. note:: If you mix sampled individuals and sampled chromosomes (non-zero values for any ``num_chrom`` and any ``num_indiv`` items), it will not be possible to process the individual level due to non-constant ploidy. In this case (to process the population level anyway), use the *skip_indiv* option of :meth:`~.coalesce.ParamDict.mk_structure`.