2015-06-02

Enhancement of the hgvs package


It’s my honor to join the Google Summer of Code 2015 and I will work on enhancing the hgvs python package (https://bitbucket.org/biocommons/hgvs/).

Genomic variants in the human genome play essential roles in many human traits. And they are the underlying basics of many human diseases. Thus, in both biomedical research and clinical genetics, the accurate and efficient detection, comparison, manipulation and interpretation of these genomic variants is highly demanded.

The hgvs python package facilities parsing, formatting and manipulating variants according to the Human Genome Variation Society (HGVS) nomenclature guidelines (www.hgvs.org/mutnomen/). The package contribute to the accurate interpretation and communication of multiple forms of genomic variants including SNP, insertion, deletion, duplication and repeats.

The hgvs package has implemented many useful functionalities for manipulating genomic variants such as validation of input variants, conversion of variants among DNA level, CDS level and protein level. This project will extend the hgvs package to make it more powerful by adding the highly demanded functionalities listed below:
  • Variants normalization
  • Equivalence checking of variants
  • Support more types of genomic variants
  • Liftover of variants between different references or transcripts
  • Easier deployment by providing a REST interface and docker images.


Variants normalization

One variant may have multiple representations. For instance, although an indel should be written as g.1001_1003delACTinsG according to the HGVS guidelines, semantically equivalent representations like g.1001_1003ACT>G are observed. Also, variants could have more than one descriptions depending on sequence context, such as variants inhomo-polymer regions and at tandem repeat regions. For example, AGTTTC to AGTTC could be described as c.3delT or c.4delT or c.5delT or c.3_5TTT>TT or c.3_4TT>T. Thus, the variants normalization is critical and essential for comparing, unique, merge and counting multiple variants.

To implement variants normalization and generate unique representation for each type of variants, they should be shuffled to 3’ most (or 5’ most) and be represented by as few nucleotides as possible. When the variant normalization is implemented, the implementation of variants comparison and equivalence checking would be straightforward.


Adding more types of genomic variants in hgvs

In this project, the supporting large genomic variants including inversions and conversions will be added to hgvs. Complex variants and mosaic variants will also be supported. The nomenclature of these types of variants has been described in HGVS mutation nomenclature guidelines.

In hgvs, the parser of variants is based on Parsing Expression Grammar. Thus, the parsing rules of these new types of variants will be designed first. Then the corresponding class of each of these variants types will be implemented.


Liftover of variants between different references or transcripts

Another useful functionality for manipulating variants is the liftover. As the changes of reference genomes and the usage of different reference transcript sets, each input variants should be able to liftover from one reference system to another reference system.


Migrate the hgvs to use REST interface

Currently, the hgvs package utilizes UTA database to get transcripts structures and sequences, by connecting to a PostgreSQL directly. A REST interface is more flexible, lightweight and easy to use. Thus, this project will implement the REST interface for the UTA database, using the Django REST framework. Then the hgvs package will be migrated to use this REST interface when querying external data from UTA database. This will reduce the package dependencies of hgvs.


The extended hgvs package will be a comprehensive and easy-to-use tool for manipulating genomic variants in a standard way under the HGVS nomenclature.