The easiest way to work with annotation is to use GenomicArray with typecode=='O' or GenomicArrayOfSets. If you have your annotation in a flat file, with each line describing a feature and giving its coordinates, you can read in the file line for line, parse it (see the standard Python module csv), use the information on chromosome, start, end and strand to create a GenomicInterval object and then store the data from the line in the genomic array at the place indicated by the genomic interval.
For example, if you have data in a tab-separated file as follows:
>>> for line in open( "feature_list.txt" ): ... print line, chr2 100 300 + "gene A" chr2 200 400 - "gene B" chr3 150 270 + "gene C"
Then, you could load this information as follows:
>>> import csv >>> genes = HTSeq.GenomicArray( [ "chr1", "chr2", "chr3" ], typecode='O' ) >>> for (chrom, start, end, strand, name) in \ ... csv.reader( open("feature_list.txt"), delimiter="\t" ): ... iv = HTSeq.GenomicInterval( chrom, int(start), int(end), strand ) ... genes[ iv ] = name
Now, to see whether there is a feature at a given GenomicPosition, you just query the genomic array:
>>> print genes[ HTSeq.GenomicPosition( "chr3", 100, "+" ) ] None >>> print genes[ HTSeq.GenomicPosition( "chr3", 200, "+" ) ] gene C
As usual, there is a parser class, called GFF_Reader, that can generate an iterator of objects describing the features. These objects are of type :class`GenomicFeature` and each describes one line of a GFF file. See Section A tour through HTSeq for an example.
As a subclass of FileOrSequence, GFF_Reader can be initialized either with a file name or with an open file or another sequence of lines.
When requesting an iterator, it generates objects of type GenomicFeature.
The GFF specification is unclear on whether the end coordinate marks the last base-pair of the feature (closed intervals, end_included=True) or the one after (half-open intervals, end_included=False). The default, True, is correct for Ensembl GTF files. If in doubt, look for a CDS or stop_codon feature in you GFF file. Its length should be divisible by 3. If “end-start” is divisible by 3, you need end_included=False. If “end-start+1” is divisible by 3, you need end_included=True.
GFF_Reader will convert the coordinates from GFF standard (1-based, end maybe included) to HTSeq standard (0-base, end not included) by subtracting 1 from the start position, and, for end_included=True, also subtract 1 from the end position.
GFF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (at least accoring to the Sanger Institute’s version of the GFF standard.) Such meta data has the format ##key value. When a metadata line is encountered, it is added to the metadata dictionary.
A GenomicFeature object always contains the following attributes:
A name of ID for the feature. As the GFF format does not have a dedicated field for this, the value of the first attribute in the attributes column is assumed to be the name of ID.
The type of the feature, i.e., a string like "exon" or "gene". For GFF files, the 3rd column (feature) is taken as the type.
The interval that the feature covers on the genome. For GFF files, this information is taken from the first (seqname), the forth (start), the fifth (end), and the seventh (strand) column.
When created by a GFF_Reader object, the following attributes are also present, with the information from the remaining GFF columns:
The 2nd column, denoted source in the specification, and intended to specify the data source.
The 8th column (frame), giving the reading frame in case of a coding feature. Its value is an integer (0, 1, or 2), or the string '.' in case that a frame is not specified or would not make sense.
The 6th column (score), giving some numerical score for the feature. Its value is a float, or '.' in case that a score is not specified or would not make sense
The last (9th) column of a GFF file contains attributes, i.e. a list of name/value pairs. These are transformed into a dict, such that, e.g., gf.attr['gene_id'] gives the value of the attribute gene_id in the feature described by GenomicFeature object gf. The parser for the attribute field is reasonably flexible to deal with format variations (it was never clearly established whetehr name and value should be sperarated by a colon or an equal sign, and whether quotes need to be used) and also does a URL style decoding, as is often required.
In order to write a GFF file from a sequence of features, this method is provided:
Returns a line to describe the feature in the GFF format. This works even if the optional attributes given above are missing. Call this for each of your GenomicFeature objects and write the lines into a file to get a GFF file.