Import/export utilities

egglib.io.to_ms(aln, outfile[, positions, ...])

Export data in the ms program format.

egglib.io.from_fasta(fname, alphabet[, ...])

Import sequences from a fasta file.

egglib.io.from_fasta_string(string, alphabet)

Import sequences from a fasta-formatted string.

egglib.io.fasta_iter(fname, alphabet[, labels])

Iterative sequence-by-sequence fasta parser.

egglib.io.from_genepop(fname)

Import Genepop-formatted genotypic data.

egglib.io.GFF3(fname[, strict])

Import GFF3 genome annotation data.

egglib.io.index_vcf(fname[, outname])

Index a BCF file.

egglib.io.VCF(fname[, index, subset])

VCF/BCF parser using htslib.

egglib.io.BED([fname])

Class holding BED (Browser Extensible Data) data.

egglib.io.from_clustal(string, alphabet)

Import a clustal-formatted alignment.

egglib.io.from_staden(string[, keep_consensus])

Import data from the Staden assembly package.

egglib.io.from_genalys(string)

Import Genalys data.

egglib.io.get_fgenesh(string[, locus])

Import fgenesh data.

egglib.io.GenBank([fname, string])

Process a GenBank-formatted DNA sequence record.

In case the VCF class is not available, the following alternative tools are maintained:

egglib.io.VcfParser(fname[, threshold_PL, ...])

Import Variant Call Format data from a file.

egglib.io.VcfStringParser(header[, ...])

Import Variant Call Format data from a string.

Below are components that do not need to be manipulated directly but are referred to by the documentation of the above items.

egglib.io.Gff3Feature()

Provide access to data of a feature.

egglib.io.GenBankFeature(parent)

Feature of a GenBank record.

egglib.io.GenBankFeatureLocation([string])

Hold the location of a GenBank feature.

Here is the description of the fasta format used in EggLib:

  • Each sequence is preceded by a header limited to a single line and starting by a > character.

  • The header length is not limited and all characters are allowed, but white spaces and special characters are discouraged. The header is terminated by a newline character.

  • Group labels are specified a special markup system placed at the end of the header line. The labels are specified by an at sign (@) followed by any string (@pop1, @pop2, @pop3 and so on). It is allowed to define several group labels for any sequence. In that case, integer values must be enter consecutively after the at sign, separated by commas, as in @cluster1,pop3,indiv2 for a sequence belonging to a given group in three different grouping levels. Multiple grouping levels can be used to specify hierarchical structure, but not only (several independent grouping structures can be specified). The markup @# (at sign and hash sign) specifies an outgroup sequence. The hash sign may be followed by an additional label as in @#, Multiple grouping levels are not allowed for the outgroup.

  • The sequence itself continues on following lines until the next > character or the end of the file. Each allele is represented by a single character.

  • White spaces, tab and carriage returns are allowed at any position. They are ignored unless for terminating the header line. There is no limitation in length and different sequences can have different lengths.

  • Allowed characters and significance of character case is determined but the alphabet specified at time of parsing. For example, the alphabets.DNA supports the characters listed below, and case is ignored (all characters are turned into upper-case).

Ambiguity characters in DNA sequences are processed by several functions. The IUPAC nomenclature is followed, as described below:

Code

Meaning

Complement

A

A

T

C

C

G

G

G

C

T

T

A

M

A or C

K

R

A or G

Y

W

A or T

W

S

C or G

S

Y

C or T

R

K

G or T

M

V

A, C, or G

B

H

A, C, or T

D

D

A, G, or T

H

B

C, G, or T

V

N

G, A, T, or C

N

-

alignment gap

-

?

G, A, T, C, or alignment gap

?

Positions with ? are supposed to be non-characterized, so it is unknown whether they have a valid base or an alignment gap.

egglib.io.to_ms(aln, outfile, positions=None, alphabet=None, as_codes=False, converter=None, spacer='auto', header='//\n')[source]

Export data in the ms program format. The program is described in Hudson (2002 [Bioinformatics 18: 337-338]). The format is as follows:

  • One header line with two slashes (can be extended/replaced).

  • One line with the number of sites

  • One line with the positions, only if the number of sites is larger than zero.

  • The matrix of genotypes (one line per sample), only if the number of sites is larger than zero.

  • One empty line.

ms generates binary data (0/1). This function may export any integer values depending on the data contained in the input object.

Parameters:
  • dataAlign instance.

  • outfile – where to write data. This argument can be (i) a file (or file-like) object open for writing, (ii) the name of a file (which will be created) as a string, or (iii) None (in which case the resulting string will be returned).

  • positions – list of site positions, with length matching the alignment length. Positions are required to be in the [0,1] range but the order is not checked. By default (if the argument value is None), sites are supposed to be evenly spread over the [0,1] interval.

  • alphabet – an Alphabet instance to use instead of the alphabet attached to the input alignment. When using this argument, alleles are transformed using the alphabet provided as argument using a direct mapping of the input alignment alphabet to the provided alphabet.

  • as_codes – use the internal alphabet codes for exporting. Useful if a non-int alphabet is used. Ignored if converter is used. Not allowed if alphabet is used.

  • converter – a user-provided function generating the appropriate allelic value based on an input alphabet code. The simple inline example below shows how to convert any missing data to 0 and shift all valid alleles of one unit: converter=lambda x: 0 if x<0 else x+1.

  • spacer – insert a space between all genotypes (to separate each locus/site). By default, all genotypes are concatenated as in the standard ms format. In this case, it is required that all allelic values (or codes if as_codes is specified) are >=0 and <10. By default ("auto"), the spacer is automatically inserted if the alphabet defines alleles or codes outside the [0, 9] range, or if converter is specified.

  • header – string to print instead of the two slashes line. Must contain the final new line.

Returns:

A ms-formatted string if outfile is None, otherwise None.

egglib.io.from_fasta(fname, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]

Import sequences from a fasta file. Create a new instance of either Align or Container from data read from a fasta-formatted file. To process data from a fasta-formatted string, use io.from_fasta_string().

Parameters:
  • source – name of a fasta-formatted sequence file.

  • alphabet – an Alphabet instance defining the type of data. Only character alphabets are allowed (such as alphabets.DNA, or alphabets.protein).

  • labels – import group labels. If so, they are not actually required to be present for each (or any) sequence. By default tags in sequence names considered to be part of the name and not as labels.

  • label_marker – this option allows to change the character indicating the start of the labels.

  • label_separator – this options allows to change character used to separate labels.

  • cls – type that should be generated. Possible values are: Align (then, data must be aligned), Container, or None. In the latter case, an Align is returned if data are found to be aligned or if the data set is empty, and otherwise a Container is returned.

Returns:

A new Container or Align instance depending on the value of the cls option.

egglib.io.from_fasta_string(string, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]

Import sequences from a fasta-formatted string. Identical to io.from_fasta() but directly takes an fasta-formatted string as first argument.

class egglib.io.fasta_iter(fname, alphabet, labels=False)[source]

Iterative sequence-by-sequence fasta parser.

Parameters:
  • fname – name of a fasta-formatted file.

  • alphabet – an Alphabet instance defining the type of data. Only character alphabets are allowed (such as alphabets.DNA and alphabets.protein).

  • labels – import group labels from sequence names (by default, they are considered as part of the name).

This function can be used in an iteration as shown below:

>>> for item in egglib.io.fasta_iter(fname):
...     ...
The with statement is also supported, which

ensures that the input file is properly closed whenever the with statement completes:

>>> with egglib.io.fasta_iter(fname) as f:
...     for item in f:
...         ...

Each iteration yields a SampleView instance (which is valid only during the iteration round, see the warning below).

Warning

The aim of this iterator is to iterate over large fasta files without actually storing all data in memory at the same time. The SampleView instance provided at each iteration is a proxy to a local Container instance that is recycled at each iteration step. The iteration variable should be used immediately and never stored as this. If one wants to sequence data, they should copy them immediately (typically using the add_sample() method of a separate Container instance).

egglib.io.from_genepop(fname)[source]

Import Genepop-formatted genotypic data. The format is described here.

Parameters:

fname – Genepop-formatted file name.

Returns:

A new Align instance.

The returned object contains data mapped to an ad hoc alphabet, with two samples per individuals. Group labels are used to indicate the structure (first level: populations, second level: individuals). In addition to normal Align instance, the returned object has two attributes: title and loci, which contain the information read from the Genepop file.

class egglib.io.GFF3(fname, strict=True)[source]

Import GFF3 genome annotation data. Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.

See the description of the GFF3 format at http://www.sequenceontology.org/gff3.shtml.

All features are loaded into memory and can be processed interactively.

Parameters:
  • source – name of a GFF3-formatted file. To pass a string, use the factory method from_string().

  • strict – by default, apply strictly GFF3 requirements. See below.

The strict argument can be set to False to support CDS features lacking a phase.

Some observations regarding this implementation:

  • The validity of parent-child relationship with respect to the Sequence Ontology is not enforced, meaning that this parser allows any type of feature to be parent to any feature of any type.

  • The Sequence Ontology accession number of types of features related to genes are automatically mapped to the type name (such as “SO:0000704”, which is translated as “gene”).

  • Discontinuous features are required to have matching attributes (except start, stop, and phase for CDS features) but subsequent segments are allowed to skip attributes if the first segment defines them.

  • fasta-formatted sequences are imported but are currently not checked for consistency with annotations and defined sequence regions.

io.GFF3 instances are iterable. The expression:

>>> for feat in GFF3:
...     ...

is equivalent to:

>>> for feat in GFF3.iter_features():
...     ...

Attributes

attribute_ontology

Content of all attribute-ontology directives.

feature_ontology

Content of all feature-ontology directives.

genome_build

Genome build if specified by a directive.

list_seqid

List of seqid.

num_seqid

Number of seqid of features present in instance.

num_top_features

Number of features without parents.

num_tot_features

Total number of features.

regions

Dictionary with all sequence regions defined by directives.

sequences

If specified, sequences as a Container.

source_ontology

Content of all source-ontology directives.

species

Species (if specified by a directive).

version

GFF version.

Methods

from_string(string[, strict])

Import data from a GFF3-formatted string.

iter_features([seqid, start, end, all_features])

Iterate over features.

property attribute_ontology

Content of all attribute-ontology directives.

property feature_ontology

Content of all feature-ontology directives.

classmethod from_string(string, strict=True)[source]

Import data from a GFF3-formatted string. This is a class method, which can be called as shown below to create a new object:

>>> gff3 = egglib.io.GFF3.from_string(gff3_string)
Parameters:
  • string – GFF3-formatted string.

  • strict – apply strict GFF3 specifications (see here).

Returns:

A new io.GFF3 instance.

property genome_build

Genome build if specified by a directive.

iter_features(seqid=None, start=None, end=None, all_features=False)[source]

Iterate over features.

Parameters:
  • seqid – only iterates overs features of this contig (by default, consider all contigs). If the specified seqid is not present in the GFF3 file, iteration is empty (without error).

  • start – start iteration at this position. By default, start with the first position.

  • end – stop iteration at this position (does not include features whose end position is larger than this value). By default, process all features.

  • all_features – whether iteration should also include features that are part of another (by default, only include features that don’t have a parent). A given feature may be repeated if it has several unconnected parents.

Note

If start and/or end are specified, seqid is required.

property list_seqid

List of seqid.

property num_seqid

Number of seqid of features present in instance.

property num_top_features

Number of features without parents.

property num_tot_features

Total number of features.

property regions

Dictionary with all sequence regions defined by directives.

property sequences

If specified, sequences as a Container. By default, this value is None.

property source_ontology

Content of all source-ontology directives.

property species

Species (if specified by a directive).

property version

GFF version.

class egglib.io.Gff3Feature[source]

Provide access to data of a feature. This class cannot be instanciated by the user.

Attributes

ID

Feature identifier.

all_parts

All parts of this feature.

attributes

Dictionary of attributes attached to the feature.

derivers

List of features deriving from this one.

descendants

List of features descending from this one.

end

Feature end position.

parents

List of features parent to this one.

score

Feature score.

segments

List of segments of the feature.

seqid

Seqid on which this feature is located.

source

Feature source.

start

Feature start position.

strand

Strand on which the feature is located.

type

Feature type.

property ID

Feature identifier.

property all_parts

All parts of this feature.

property attributes

Dictionary of attributes attached to the feature. The list of attributes is:

  • ID

  • Name

  • Alias

  • Parent

  • Target

  • Gap

  • Derives_from

  • Note

  • Dbxref

  • Ontology_term

  • Is_circular

  • Other attributes as defined in the GFF3 file.

property derivers

List of features deriving from this one.

property descendants

List of features descending from this one.

property end

Feature end position.

property parents

List of features parent to this one.

property score

Feature score.

property segments

List of segments of the feature.

property seqid

Seqid on which this feature is located.

property source

Feature source.

property start

Feature start position.

property strand

Strand on which the feature is located.

property type

Feature type.

egglib.io.index_vcf(fname[, outname])

Index a BCF file. The file is required to be in format BCF. If outname is not specified, use the standard naming scheme for CSI index files.

class egglib.io.VCF(fname, index=None, subset=None)

VCF/BCF parser using htslib.

Parameters:
  • fname – input VCF/BCF file name. Gzip-compressed files are supported.

  • index – index file name. By default, try to import index with canonical file name. If index is specified, it must be present. Otherwise missing index file is not an error. Index is only imported for BCF files.

  • subset – sequence of sample names to import. The order of samples in this sequence is not considered. Duplicated names in this sequence are ignored. Other samples are ignored. This is useful to speed up parsing.

Attributes

has_index

Boolean indicating whether an index is available.

num_samples

Number of samples.

Methods

get_alleles

list of alleles.

get_alternate

list of alternate alleles.

get_chrom

Chromosome or contig name.

get_errors

Errors while reading last variant.

get_filter

list of filters.

get_format(tag, index)

Get a given FORMAT field.

get_formats

FORMAT fields for all samples.

get_genotypes

Get genotypes.

get_id

Get list of identifiers for the current variant.

get_info(tag)

Get a given INFO field.

get_infos

dict of INFO fields.

get_phased

Get booleans indicating if genotypes are phased.

get_pos

Chromosome position.

get_quality

Quality value.

get_reference

Reference allele.

get_sample(index)

Get the name of the sample at index index.

get_samples

list of all samples.

get_types

Get the type(s) of the last variant.

goto(target[, position, [limit]])

Move to a given location in the file.

is_snp

Check if last variant is a SNP.

read

Read one variant of the VCF file.

get_alleles()

list of alleles. Return None by default (no available data).

get_alternate()

list of alternate alleles. Return None by default (no available data).

get_chrom()

Chromosome or contig name. Return None by default (no available data).

get_errors()

Errors while reading last variant. Get the non-fatal errors generated while importing last variant, as a list, or None if nothing has been read.

get_filter()

list of filters. Return None by default (no available data).

get_format(tag, index)

Get a given FORMAT field. Return None by default (no available data for key not available).

get_formats()

FORMAT fields for all samples. Return a :list: of :dict: instances. Return None by default (no available data).

get_genotypes()

Get genotypes. Return a list giving, for each sample, the list of alleles composing its genotype. Return None by default (no data available).

get_id()

Get list of identifiers for the current variant. Empty list if none provided. None if nothing has been read. The uniqueness of ID’s is not tested.

get_info(tag)

Get a given INFO field. Return None by default (no available data for key not available).

get_infos()

dict of INFO fields. Return None by default (no available data).

get_phased()

Get booleans indicating if genotypes are phased. The return value is the tuple: (all_phased, phased_table), with a boolean for all samples and all alleles beyond the first. None whatever bad happens.

get_pos()

Chromosome position. Return None by default (no available data).

get_quality()

Quality value. Return None by default (no available data or missing value).

get_reference()

Reference allele. Return None by default (no available data).

get_sample(index)

Get the name of the sample at index index.

get_samples()

list of all samples.

get_types()

Get the type(s) of the last variant. Return a list. Return None by default (no available data).

goto(target[, position[, limit]])

Move to a given location in the file. Data at the new location are available immediately with no need to call read() (this method should be understood as a call to read() at an arbitrary location. If position is not specified, move to the first available position of contig target. If target does not exist in file or if position is specified, but is not available within the region specified by limit, a ValueError is thrown. ValueError is also thrown in case of unexpected errors during processing. By default, limit is equal to position + 1 (meaning that only the exact position can be retrieved). If this condition is not met (in particular if position is past the end of the contig target, return False.

Note

Only available for indexed BCF.

has_index

Boolean indicating whether an index is available.

is_snp()

Check if last variant is a SNP. True if the last variant is of type SNP, and SNP only.

num_samples

Number of samples.

read()

Read one variant of the VCF file. Return True if read is successful, False if end of file. ValueError in case of critical error.

egglib.io.from_clustal(string, alphabet)[source]

Import a clustal-formatted alignment. The input format is the one generated and used by ClustalW.

Parameters:
  • string – clustal-formatted sequence alignment.

  • alphabet – an Alphabet instance.

Returns:

A new Align instance.

egglib.io.from_staden(string, keep_consensus=False)[source]

Import data from the Staden assembly package. Process the output file of the GAP4 program of the Staden package.

Parameters:
  • string – input string.

  • keep_consensus – don’t delete consensus sequence.

Returns:

An Align instance.

The input string should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option keep_consensus is used.

Staden’s default convention is followed:

  • - codes for an unknown base and is replaced by N.

  • * codes for an alignment gap and is replaced by -.

  • . represents the same sequence than the consensus at that position.

  • White space represents missing data and is replaced by ?.

egglib.io.from_genalys(string)[source]

Import Genalys data. Genalys was a proprietary program to process sequencing reads, perform assembly and detect polymorphism. Convert Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option “Save SNPs” of Genalys 2.8.

Parameters:

string – input data as a Genalys-formatted string.

Returns:

An Align instance.

egglib.io.get_fgenesh(string, locus='locus')[source]

Import fgenesh data.

Parameters:

fname – a string containing fgenesh ouput.

Parma locus:

locus name.

Returns:

A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.

class egglib.io.GenBank(fname=None, string=None)[source]

Process a GenBank-formatted DNA sequence record.

Parameters:
  • fname – input file name.

  • string – GenBank-formatted string.

Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with a sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which default is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.

In addition to methods documented below, the following operations are supported for gb if it is a GenBank instance:

Operation

Result

len(gb)

length of the sequence attached to this record

for feat in gb:

Iterate over GenBankFeature instances of this record

str(gb)

GenBank representation of the record

Attributes

sequence

Sequence string.

Methods

add_feature(feature)

Add a feature to the instance.

extract(from_pos, to_pos)

Extract a subset of the instance.

format()

String representation of the instance.

number_of_features()

Number of features contained in the instance.

rc()

Reverse-complement the instance (in place).

write(fname)

Write the formatted record to a file.

write_stream(stream)

From the formatted record to a stream.

add_feature(feature)[source]

Add a feature to the instance.

Parameters:

feature – an io.GenBankFeature instance.

extract(from_pos, to_pos)[source]

Extract a subset of the instance.

Parameters:
  • from_pos – start position.

  • to_pos – stop position (not included).

Returns:

A new io.GenBank instance representing a subset of the current instance. All features that are completely included in the specified range are exported.

format()[source]

String representation of the instance.

number_of_features()[source]

Number of features contained in the instance.

rc()[source]

Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.

property sequence

Sequence string. Note that changing the record’s string might invalidate the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).

write(fname)[source]

Write the formatted record to a file.

write_stream(stream)[source]

From the formatted record to a stream.

Parameters:

stream – a file-compatible object.

class egglib.io.GenBankFeature(parent)[source]

Feature of a GenBank record. io.GenBankFeature instances must be only used along an io.GenBank instance. The constructor creates an empty instance (although a parent io.GenBank instance is required) and either update() or parse() must be used subsequently.

Parameters:

parent – an io.GenBank instance to which the feature should be attached.

Note that str(feature) is equivalent to feature.format().

Methods

add_qualifier(key, value)

Add a qualifier.

copy(genbank)

Return a copy of the current instance.

format()

GenBank-formatted string representing the feature.

get_sequence()

Return the string corresponding to this feature.

get_start()

First position of the first segment.

get_stop()

Stop position of the last segment.

get_type()

Type of the instance.

parse(string)

Update feature information from a string.

qualifiers()

Dictionary with all qualifier values.

rc([length])

Reverse-complement the feature.

shift(offset)

Shift all positions.

update(feat_type, location, **qualifiers)

Update feature information.

add_qualifier(key, value)[source]

Add a qualifier.

copy(genbank)[source]

Return a copy of the current instance.

Parameters:

genbankGenBank instance to which the returned instance should be attached.

format()[source]

GenBank-formatted string representing the feature.

get_sequence()[source]

Return the string corresponding to this feature. If the positions pass beyond the end of the parent’s sequence, a RuntimeError (and not an IndexError) is raised.

get_start()[source]

First position of the first segment.

get_stop()[source]

Stop position of the last segment.

get_type()[source]

Type of the instance.

parse(string)[source]

Update feature information from a string.

Parameters:

string – a GenBank-formatted string.

qualifiers()[source]

Dictionary with all qualifier values. This method cannot be used to change data within the instance.

rc(length=None)[source]

Reverse-complement the feature.

Parameters:

length – length of the complete sequence (by default, take the information directly from the parent).

shift(offset)[source]

Shift all positions. The argument can be positive or negative.

update(feat_type, location, **qualifiers)[source]

Update feature information.

Parameters:
  • feat_type – a string identifying the feature type (such as "gene", "CDS", "misc_feature", etc.). All strings are acceppted.

  • location – an io.GenBankFeatureLocation instance giving the feature’s location.

  • qualifiers – other qualifiers must be passed as keyword arguments. It is not allowed to use "type" as a qualifier keyword.

class egglib.io.GenBankFeatureLocation(string=None)[source]

Hold the location of a GenBank feature. Supports various forms of location specifications as allowed in the GenBank format. The constructor allows to parse a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).

In addition to methods documented below, the following operations are supported for loc if it is a GenBankFeatureLocation instance:

Operation

Result

len(loc)

Number of segments

loc[index]

Return the (first, last) tuple

for (first, last) in loc:

Iterator over segments

params.format()

Generate a GenBank representation

Methods

add_base_choice(first, last[, left_partial, ...])

Segment corresponding to a single base in a given range.

add_base_range(first, last[, left_partial, ...])

Add a base range to the feature.

add_between_base(position)

Add a segment lying between two consecutive bases.

add_single_base(position)

Add a single-base segment to the feature.

as_order()

Define the feature as an order instead of a range.

as_range()

Define the feature as a range.

copy()

Deep copy of the current instance.

format()

String representation of the instance.

is_complement()

True if the feature is on the complement strand.

is_range()

True if the feature is a range.

rc(length)

Reverse the feature positions.

set_complement()

Place the feature on the complement strand.

set_forward()

Place the feature on the forward strand.

shift(offset)

Shift all positions.

add_base_choice(first, last, left_partial=False, right_partial=False)[source]

Segment corresponding to a single base in a given range. Arguments are identical to add_base_range().

add_base_range(first, last, left_partial=False, right_partial=False)[source]

Add a base range to the feature.

Parameters:
  • first – first position of the range.

  • last – last position of the range (included).

  • first_partial – specify that the real start of the segment is somewhere 5’ of first.

  • first_partial – specify that the real end of the segment is somewhere 3’ of last.

If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_range(1127,1482) in combination with set_complement().

add_between_base(position)[source]

Add a segment lying between two consecutive bases. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use add_between_base(1127) in combination with set_complement().

add_single_base(position)[source]

Add a single-base segment to the feature.

as_order()[source]

Define the feature as an order instead of a range.

as_range()[source]

Define the feature as a range. This is the default.

copy()[source]

Deep copy of the current instance.

format()[source]

String representation of the instance.

is_complement()[source]

True if the feature is on the complement strand.

is_range()[source]

True if the feature is a range. False if it is an order.

rc(length)[source]

Reverse the feature positions. Positions are modified to be counted from the end.

Parameters:

length – length of the complete sequence (required).

set_complement()[source]

Place the feature on the complement strand.

set_forward()[source]

Place the feature on the forward strand. This is the default.

shift(offset)[source]

Shift all positions. The argument can be positive or negative.

Alternative VCF parser

class egglib.io.VcfParser(fname, threshold_PL=None, threshold_GL=None)[source]

Import Variant Call Format data from a file. The VCF format is designed to store data describing genomic variation in an extremely flexible way. See the description of the VCF format. To parse VCF data stored in string, use io.VcfStringParser.

Parameters:
  • fname – name of a properly formatted VCF file. The header section will be processed upon instance creation, and lines will be read later, when the user iterates over the instance (or call readline()).

  • threshold_PL – call genotypes parameter. This parameter controls how genotype calling (GT field) is performed from the PL (phred-scaled genotype likelihood) field. By default (None), this conversion is never performed. If threshold_PL is specified, genotype calling is only performed if GT is not available and PL is available. The genotype with the lowest PL value is selected. The parameter gives the minimum acceptable gap between the best genotype and the second one. If the second genotype has a too good score, based on this parameter, the genotype is called unknown. The parameter must be at least 1.

io.VcfParser instances are iterable. Everly loop yields a (chromosome, position, num_all) tuple that allows the user to determines if the variant is of interest. Note that the position is considered as an index and therefore has been decremented compared with the value found in the file. If the variant is of interest, io.VcfParser instances provide methods to extract all data for this variant. Iterating over VCF lines can be performed manually using the method readline().

Attributes

currline

Index of the current line of the VCF file.

file_format

File format specified in the header.

good

Tell if the file is good for reading.

has_index

True if an index file has been loaded.

num_alt

Number of ALT definitions in the header.

num_filter

Number of FILTER definitions in the header.

num_format

Number of FORMAT definitions in the header.

num_index

Number of indexed lines if the file index.

num_info

Number of INFO definitions in the header.

num_meta

Number of META definitions in the header.

num_samples

Number of samples defined in the header.

Methods

bed_slider(bed[, max_missing, allow_indel, ...])

Perform a sliding window based on BED coordinates.

close()

Close file (if it is open)

get_alt(idx)

Get an ALT definition from the header.

get_filter(idx)

Get a FILTER definition from the header.

get_format(idx)

Get a FORMAT definition from the header.

get_genotypes([start, stop, dest])

Extract genotype data for the current site.

get_info(idx)

Get an INFO definition from the header.

get_meta(idx)

Get data for a given META field defined in the VCF header.

get_sample(idx)

Get the name of a sample read from the header.

get_variant()

Get all data for the current variant.

goto(self, contig[, position])

Move to an arbitrary position of the VCF file.

load_index([fname])

Load an index file allowing fast navigation.

readline()

Read one variant.

rewind()

Reset the parser.

slider(size, step[, as_variants, start, ...])

Start a sliding window from the current position.

unread()

Go back to the previous variant.

bed_slider(bed, max_missing=0, allow_indel=False, allow_custom=False)[source]

Perform a sliding window based on BED coordinates.

Parameters:
  • bed – an io.BED instance.

  • max_missing

    maximum number of missing alleles.

    Warning

    Here, max_missing must be an absolute number of samples.

  • allow_indel – include variants with alleles of varying size.

  • allow_custom – include variants with custom alleles.

Returns:

An io.VcfWindow instance.

Changed in version 3.2.0: max_missing is required to be an integer.

close()[source]

Close file (if it is open)

property currline

Index of the current line of the VCF file.

property file_format

File format specified in the header.

get_alt(idx)

Get an ALT definition from the header. The index must be smaller than num_alt. Return a dictionary containing the following data:

  • "id": identifier string.

  • "description": description string.

  • "extra"s: all extra qualifiers, presented as list of (key, value) tuple instances.

get_filter(idx)

Get a FILTER definition from the header. The index must be smaller than num_filter. Return a dictionary containing the following data:

  • "id": identifier string.

  • "description": description string.

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_format(idx)

Get a FORMAT definition from the header. The index must be smaller than num_format. Return a dictionary containing the following data:

  • "id": identifier string.

  • "type": one of "Integer", "Float", "Character", and "String".

  • "description": description string.

  • "number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_genotypes(start=0, stop=None, dest=None)

Extract genotype data for the current site.

It is required that a variant has been effectively processed.

Parameters:
  • start – index of the first sample to process (at least 0 and smaller than the number of samples).

  • stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).

  • dest – a Site instance that will be recycled and used to place results.

Returns:

A Site instance by default, or None if dest was specified.

get_info(idx)

Get an INFO definition from the header. The index must be smaller than num_info. Return a dictionary containing the following data:

  • "id": identifier string.

  • "type": one of "Integer", "Float", "Flag", "Character", and "String".

  • "description": description string.

  • "number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_meta(idx)

Get data for a given META field defined in the VCF header. The index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.

get_sample(idx)

Get the name of a sample read from the header. The index must be smaller than num_samples.

get_variant()

Get all data for the current variant. Return an io.VcfVariant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.

property good

Tell if the file is good for reading. False is the end of the file has been reached.

goto(self, contig, position=egglib.io.FIRST)[source]

Move to an arbitrary position of the VCF file. Requires an index (see load_index()). A ValueError is raised if the position cannot be found.

Parameters:
  • contig – contig name.

  • position – contig position. Use io.FIRST for the first available variant of the contig, io.LAST for the last, and -1 for the position immediately before the first position (position 0 in VCF files).

property has_index

True if an index file has been loaded.

load_index(fname=None)[source]

Load an index file allowing fast navigation. Index files allow fast navigation in VCF file regardless of their size, allowing to move instantly to a determined variant using its position or its index in file. Index files can be created by io.make_vcf_index().

Parameters:

fname – name of the index file. By default, use the default index file name, which is the name of the VCF file with the “.vcfi” extension (removing the original extension only if it is “.vcf”).

property num_alt

Number of ALT definitions in the header.

property num_filter

Number of FILTER definitions in the header.

property num_format

Number of FORMAT definitions in the header.

property num_index

Number of indexed lines if the file index.

property num_info

Number of INFO definitions in the header.

property num_meta

Number of META definitions in the header.

property num_samples

Number of samples defined in the header.

readline()[source]

Read one variant. Return a tuple with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF file). Raise a ValueError if file is finished.

rewind()[source]

Reset the parser. Move back to the first variant. No index is required.

slider(size, step, as_variants=False, start=0, stop=None, max_missing=0, allow_indel=False, allow_custom=False)[source]

Start a sliding window from the current position.

Parameters:
  • size – size of the sliding window (by default, in base pairs).

  • step – increment of the sliding window (by default, in base pairs).

  • as_variants – express size and step in number of variants instead of base pairs.

  • start – start position of the sliding window.

  • stop – stop position of the sliding window.

  • max_missing

    maximum number of missing alleles (variants exceeding this threshold are ignored).

    Warning

    Here, max_missing must be an absolute number of samples.

  • allow_indel – include variants with alleles of variable size

  • allow_custom – include variants with custom alleles

Returns:

An io.VcfSlidingWindow instance.

Changed in version 3.2.0: max_missing is required to be an integer.

unread()[source]

Go back to the previous variant. No index is required, but only allowed after reading one line (not allowed immediately after creating instance or calling rewind()).

egglib.io.make_vcf_index(fname, outname=None)[source]

Create the index for a VCF file.

Parameters:
  • fname – name of a VCF file.

  • outname – name of the index file. By default, use a default name. The default name is based on the VCF file name, stripping the extension only if it is vcf and appending the vcfi extension. Index files bearing this default name are automatically loaded if the corresponding VCF file is opened.

class egglib.io.VcfStringParser(header, threshold_PL=None, threshold_GL=None)[source]

Import Variant Call Format data from a string. Alias of io.VcfParser processing strings instead of a file. The constructor takes a string containing a VCF header (the first line being the file format specification and the last line being the header line ,starting with #CHROM).

Attributes

file_format

File format specified in the header.

num_alt

Number of ALT definitions in the header.

num_filter

Number of FILTER definitions in the header.

num_format

Number of FORMAT definitions in the header.

num_info

Number of INFO definitions in the header.

num_meta

Number of META definitions in the header.

num_samples

Number of samples defined in the header.

Methods

get_alt(idx)

Get an ALT definition from the header.

get_filter(idx)

Get a FILTER definition from the header.

get_format(idx)

Get a FORMAT definition from the header.

get_genotypes([start, stop, dest])

Extract genotype data for the current site.

get_info(idx)

Get an INFO definition from the header.

get_meta(idx)

Get data for a given META field defined in the VCF header.

get_sample(idx)

Get the name of a sample read from the header.

get_variant()

Get all data for the current variant.

readline(string)

Read one variant from a user-provided single line.

property file_format

File format specified in the header.

get_alt(idx)

Get an ALT definition from the header. The index must be smaller than num_alt. Return a dictionary containing the following data:

  • "id": identifier string.

  • "description": description string.

  • "extra"s: all extra qualifiers, presented as list of (key, value) tuple instances.

get_filter(idx)

Get a FILTER definition from the header. The index must be smaller than num_filter. Return a dictionary containing the following data:

  • "id": identifier string.

  • "description": description string.

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_format(idx)

Get a FORMAT definition from the header. The index must be smaller than num_format. Return a dictionary containing the following data:

  • "id": identifier string.

  • "type": one of "Integer", "Float", "Character", and "String".

  • "description": description string.

  • "number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_genotypes(start=0, stop=None, dest=None)

Extract genotype data for the current site.

It is required that a variant has been effectively processed.

Parameters:
  • start – index of the first sample to process (at least 0 and smaller than the number of samples).

  • stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).

  • dest – a Site instance that will be recycled and used to place results.

Returns:

A Site instance by default, or None if dest was specified.

get_info(idx)

Get an INFO definition from the header. The index must be smaller than num_info. Return a dictionary containing the following data:

  • "id": identifier string.

  • "type": one of "Integer", "Float", "Flag", "Character", and "String".

  • "description": description string.

  • "number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).

  • "extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_meta(idx)

Get data for a given META field defined in the VCF header. The index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.

get_sample(idx)

Get the name of a sample read from the header. The index must be smaller than num_samples.

get_variant()

Get all data for the current variant. Return an io.VcfVariant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.

property num_alt

Number of ALT definitions in the header.

property num_filter

Number of FILTER definitions in the header.

property num_format

Number of FORMAT definitions in the header.

property num_info

Number of INFO definitions in the header.

property num_meta

Number of META definitions in the header.

property num_samples

Number of samples defined in the header.

readline(string)[source]

Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided when creating this instance.

Returns:

A tuple with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF line).

class egglib.io.VcfVariant[source]

Represent a single VCF variant. The user cannot create instances of this class himself (instances are generated by io.VcfParser or io.VcfStringParser).

Note

The AA (ancestral allele), AN (allele number), AC (allele count), and AF (allele frequency) INFO fields as well as the GT (deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributes AN, AA, AC, AF, GT, ploidy and GT_phased. However, they are also available in the respective info and samples (sub)-dictionaries.

Attributes

AA

Value of the AA info field.

AC

Value of the AC info field.

AF

Value of the AF info field.

AN

Value of the AN info field.

GL

GL values for all samples

GT

Genotypes from GT fields.

GT_phased

Tell if the genotype for each sample is phased.

GT_vcf

GT field as written in the VCF file

ID

Tuple containing all IDs.

PL

PL values for all samples

alleles

Tuple of variant alleles.

alternate_types

Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele)::.

chromosome

Chromosome name.

failed_tests

Filters at which this variant failed.

format_fields

Frozenset of FORMAT fields ID's.

info

Dictionary of INFO fields for this variant.

num_alleles

Number of alleles.

num_alternate

Number of alternate alleles.

num_genotypes

Number of genotypes.

num_samples

Number of samples.

ploidy

Ploidy among genotypes.

position

Position.

quality

Variant quality.

samples

Information for each sample.

alt_type_default

Explicit alternate allele (the string represents the nucleotide sequence of the allele).

alt_type_referred

Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).

alt_type_breakend

Alternate allele symbolizing a breakend (see VCF description for more details).

property AA

Value of the AA info field. None if missing.

property AC

Value of the AC info field. Provided as a tuple. None if missing.

property AF

Value of the AF info field. Provided as a tuple. None if missing.

property AN

Value of the AN info field. None if missing.

property GL

GL values for all samples

property GT

Genotypes from GT fields. Only if this format field is available. Provided as a tuple of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples). The number of items within each sub-tuples is equal to the ploidy (ploidy). These items are allele expression (as found in alleles), or None (for missing values). This attribute is None if GT is not available.

property GT_phased

Tell if the genotype for each sample is phased. None if GT is not available.

property GT_vcf

GT field as written in the VCF file

property ID

Tuple containing all IDs.

property PL

PL values for all samples

property alleles

Tuple of variant alleles. The first is the reference, which is not guaranteed to be present in samples.

property alternate_types

Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele):

>>> type_ = variant.alternate_types[0]
>>> if type_ == variant.alt_type_default:
...     allele = variant.allele(0)
property chromosome

Chromosome name. None if missing.

property failed_tests

Filters at which this variant failed. Provided as a tuple or names, or None if no filters have been applied.

property format_fields

Frozenset of FORMAT fields ID’s. The frozenset is empty if no sample data are available.

property info

Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a tuple of items, even if there is only one item. For flag INFO types, the value is always an empty tuple.

property num_alleles

Number of alleles. The reference is always included.

property num_alternate

Number of alternate alleles. Equal to num_alleles minus 1.

property num_genotypes

Number of genotypes.

property num_samples

Number of samples. Equivalent to len(variant.samples).

property ploidy

Ploidy among genotypes. Always 2 if GT is not available.

property position

Position. Given as an index; first value is 0. None if missing.

property quality

Variant quality. None if missing.

property samples

Information for each sample. Empty list if no samples are defined. Otherwise, the list contains one dictionary for each sample: keys of these dictionaries are FORMAT fields identifiers (the keys are always the same as the content of format_fields), and their values are tuple instances in all cases.

class egglib.io.VcfSlidingWindow[source]

This class manages a sliding window on a VCF file. This class cannot be instanciated directly: instances are returned by the methods slider() and bed_slider() of io.VcfParser instances.

io.VcfSlidingWindow instances are iterable: iteration steps return a common io.VcfWindow instance which is updated at each iteration round.

Attributes

good

True if there is still data to be processed.

property good

True if there is still data to be processed.

class egglib.io.VcfWindow[source]

Provide access to sites of a window from a VCF file. The following operations are supported by io.VcfWindow instances:

Operation

Result

len(win)

number of sites in the window

win[i]

get a site as a Site instance

for site in win:

iterate over sites as Site instances

Attributes

bounds

Bounds of the window.

chromosome

Name of the chromosome or contig.

num_sites

Number of sites in window.

property bounds

Bounds of the window. The second bound is not included in the window.

property chromosome

Name of the chromosome or contig.

property num_sites

Number of sites in window.

egglib.io.FIRST

First available position of a contig.

egglib.io.LAST

Last available position of a contig.

class egglib.io.BED(fname=None)[source]

Class holding BED (Browser Extensible Data) data.

Parameters:

fname – name of the BED-formatted input file. By default, create an empty instance.

BED is a flexible format representing genomic annotation. In the current implementation, only the chromosome/scaffold, start, and end positions (the first three fields of each lines, which are the only ones to be required) are processed. If incorrectly formed fields are provided besides the first three fields, the behaviour is undefined (that is, it is not guaranteed that are exception is raised).

Supported operations:

Operation

Result

Notes

len(bed)

Number of annotation items

bed[i]

Get an annotation

for i in bed:

Iterate over annotations

Notes:

  1. Annotation items are represented by dictionaries containing the following keys: "chrom", "start", and "end".

Methods

append(chrom, start, end)

Add an item at the end of the BED data.

extend(values)

Append items from an iterable.

append(chrom, start, end)[source]

Add an item at the end of the BED data.

extend(values)[source]

Append items from an iterable. All items of values must be a sequence containing a value for chrom, start, and end (in that order).