Mote

2015-08-25

Summary of the extending hgvs project

Time goes so fast that about four months have passed since I started working on this project as a GSoC student. Fortunately, all the three goals of this project are accomplished finally:

Implement the variant normalizer

The normalizer in hgvs is extensively tested for all kinds of variants and variants in extreme context, like variants located at exon-intron boundary. The normalizer is flexible to use. The normalizer is configurable to shuffle to 3’ or 5’ direction. Users also could choose whether allowing the shuffling cross the exon-intron boundary or not.

Support the parsing and manipulating of complex variants

The substitutions, indels, insertions, deletions and duplications have already been supported in hgvs. Now the hgvs also supports the parsing and manipulating complex variants, including compound variants, mosaic variants and chimeric variants, which are composed of multiple simple sequence variatns.

Add REST interface to the UTA database

The Universal Transcript Archive (UTA) database stores rich transcripts related information, including sequences, exon structures and reference-transcript alignments. It is not only used by the hgvs package when mapping, validating and normalizing variants, but also could benefit the research of human transcripts. The REST interface makes it much easier for users to access data from this database. This also simplifies the installation of hgvs package.

Here I list the main features of the extended hgvs package:

It supports the parsing and manipulating all kinds of HGVS variants as the nomenclature specified, including sub, delins, del, ins, dup, inv, con, compound variants, mosaic and chimeric variants, except the translocation variants.
It provides full sort of variants manipulating operations, including mapping among genome, transcript and protein sequences, variants internal and external validation, and variants normalization for all kinds of genomic and transcripts variants.
It facilities the batch processing of large number of variants.

I am very proud to be a contributor to this useful project. I feel very pleased to see that the hgvs package is becoming a more and more powerful and comprehensive package for parsing and manipulating all kinds of HGVS variants. I hope the hgvs package would benefit the genomic variants research community.

Here I’d like to show my great thanks to Dr. Reece Hart, who gave me a lot of guidelines and suggestions during this project. We met online every Tuesday and Friday since the beginning of the project, when we discussed my questions and what to do next. Without his kind help, I won’t complete this project successfully. I will also thank Dr. Kevin Jacobs for his helpful discussions and suggestions on implementing the normalizer.

It’s a nice experience to participate the GSoC and make my contribution to open source project. I also learned a lot during the development, including how to work on collaborating project through Internet, how to work with branches, besides technical things I learned. This is a wonderful and memorable journey in my life.

2015-08-21

REST interface to the UTA database

Mapping, validating, and normalizing sequence variants requires access to diverse biological sequence data. For example, mapping variants between genomic and transcript coordinates requires coordinates for exon boundaries, and validating and normalize a variant requires access to reference sequences.

The hgvs Python package includes a pluggable interface to pull these data from remote sources. The default implementation of this interface uses a sister project, the Universal Transcript Archive (UTA). UTA stores human transcripts, sequences, exon structures and reference-transcript alignment details . Importantly, the coordinates are exactly as provided by NCBI (using splign), UCSC (using blat), and Ensembl (from the genebuild pipeline).

Currently, the UTA database is mainly used by the hgvs package to map variants between genomic, transcript, and protein coordinates. Using UTA requires users to install libpq, the PostgreSQL network protocol library, in their system and install the psycopg2 package if they want to access through Python. While this is not a burden for many, a REST interface would obviate these steps and simplify installation.

Here we implemented a REST proxy for the hgvs dataprovider. This makes it much easier for users to install the hgvs package, and also enables others to take advantage of UTA. Meanwhile, this REST interface also eliminates the dependency of libpq and psycopg2 package for most hgvs package users.

The REST interface is based on Flask (http://flask.pocoo.org/) and Flask-RESTful (https://flask-restful.readthedocs.org) framework. We choose Flask because it is a widely employed python web framework. The Flask-RESTful extension makes things easier to design the REST interface. It makes it possible to map the URLs and classes in a unified place. It also provides functionalities to deal with the query arguments and response fields.

In fact, the REST interface to UTA database is a thin wrapper of current dataprovider in hgvs. The overall architecture of the REST interface server is as follows:

I also made a docker image (https://hub.docker.com/r/icebert/uta_rest/) that integrates the UTA REST server and an uWSGI server. This makes things easier to deploy the REST server.

It is such a valuable transcripts database that would benefit the research of human transcripts and variants.

For example, if you want to find all the transcripts in a given genomic region (taking position 100000 to 200000 on chr20 for instance), you only need to make a query to the UTA using: http://api.biocommons.org/tx_for_region?alt_ac=NC_000020.10&alt_aln_method=splign&start=100000&end=200000

And if you want to find similar transcripts for a given transcript (NM_199425.2), what you need to do is simply query the UTA by: http://api.biocommons.org/similar_transcripts?tx_ac=NM_199425.2 This will return a list of transcripts that are similar to the given transcript, with relevant similarity criteria, in json format.

Here is a summary of the APIs provided in the UTA REST interface:

Endpoint	Request arguments	Description
data_version	(None)	UTA data version.
schema_version	(None)	database schema version.
tx_exons	tx_ac alt_ac alt_aln_method	return transcript exon info for supplied accession.
tx_info	tx_ac alt_ac alt_aln_method	return a single transcript info for supplied accession.
sequence	ac	Fetches sequence by accession, optionally bounded by [start, end) .
tx_for_gene	gene	return transcript info records for supplied gene, in order of decreasing length.
tx_for_region	alt_ac alt_aln_method start end	return transcripts that overlap given region.
acs_for_protein_seq	seq	returns a list of protein accessions for a given sequence.
gene_info	gene	returns basic information about the gene.
tx_mapping_options	tx_ac	Return all transcript alignment sets for a given transcript accession.
tx_identity_info	tx_ac	returns features associated with a single transcript.
similar_transcripts	tx_ac	Return a list of transcripts that are similar to the given transcript, with relevant similarity criteria.
pro_ac_for_tx_ac	tx_ac	Return the (single) associated protein accession for a given transcript accession.

Support of Complex Variants in hgvs

The hgvs package has supported the parsing and manipulating (mapping, validating and normalizing) basic type of variants including substitutions, indels, insertions, deletions, duplications etc.

As part of my Google Summer of Code project, I implemented support for complex variants in hgvs, including compound variants, mosaic variants and chimeric variants. This feature will be incorporated in the hgvs 0.5.0 release.

Compound variants describe multiple variants in one individual. These variants may be on the same chromosome, or on different chromosome. For example, c.[76A>C; 83G>C] describes two changes found in one individual on the same chromosome and c.[76A>C];[83G>C] describes two changes on each chromosome – maternal and paternal. Mosaic variant is two or more different nucleotides in one position caused by somatic mutations, which is represented as c.[83G=/83G>C]. And chimeric variants describe multiple different nucleotides in one position but in different cells, for example c.[83G=//83G>C].

In summary, according to the recommendations of HGVS (http://www.hgvs.org/mutnomen/recs-DNA.html), such complex variants are described as:

Compound variant

AC:type.[first edit maternal;second edit maternal];[first edit paternal;second edit paternal]

AC:type. [first edit;second edit]

AC:type. [first edit(;)second edit]

Mosaic variant

AC:type.[edit 1/edit 2/edit 3]

Chimeric variant

AC:type.[edit 1//edit 2//edit 3]

These complex variants contain multiple posedits (position+edits). That’s the biggest difference to the simple variants, which only contains one posedit for each variant. And for the compound variants, we also need to store the phase information for each variant.

We modeled the complex variants as a list of simple variants, instead of only a list of posedit. Here is the model for representing complex variants in hgvs internally:

So each item in a complex variant is a simple variant, which has the access number, the type and one posedit. Here are the reasons we do this:

Although in most cases the access number for all posedits in a complex variants is the same, when we perform variants mapping from a genomic variant (g.) to transcripts level (c.), the different posedits may map to different transcripts. Thus, each item in a complex variant should store its own accession. And we provide function that could check whether all the posedits in a complex variant have the same accession.
Modeling the complex variants as a list of simple variants provides straightforward access to the subordinate variants by using the indexes.
This also makes it much easier to map, validate and normalize these complex variants. These manipulation for complex variants becomes the manipulation of each simple variant one by one in a complex variant.

We also provide a complete set of functions that support the manipulation of complex variants as simple variants. The complex variants and simple variants have the same attributes and the same manipulation. The only difference is that the result given by complex variant is a list, while the result given by simple variant is a single value.

For complex variants var (compound variant, mosaic variant and chimeric variant):

Attributes	Result class
var[0]	SequenceVariant
var[0].posedit	PosEdit
var[0].posedit.pos	Location
var[0].posedit.edit	Edit

var.posedit	PosEditSet (a list of PosEdit)
var.posedit.pos	a list of Location
var.posedit.edit	a list of Edit

var.ac	the ac (if all ac is the same) or a list of ac
var.type	the type (all type should be the same)