2015-08-21

Implementing Sequence Variant Normalization in the hgvs package


In principle, sequence variants have many equivalent representations. For instance, delGinsA could also be written as delTGinsTA.  In addition, variants may have more than one description depending on the sequence context, such as variants occurring in repeat regions. For example, AGTTTC to AGTTC could be described as 3delT or 4delT or 5delT.


The most basic notion of variant comparison is equivalence of a normalized representation. Variant normalization is essential for comparing and counting variants. Several tools have been developed to perform normalization of variants in VCF files, like vt (https://github.com/atks/vt). Although the HGVS recommendations specify rules for variant normalization, such as using the 3’-most representation of a variant, the hgvs Python package did not support them. Unlike variants in VCF format which is only represented by the reference allele and alternative allele, the variants in HGVS format contain much more information and different types of variants are explicitly expressed in different ways like substitution(>), delins, del, ins, dup, dupN, inv and con etc. So the normalization of HGVS variants is more complicated than the normalization of VCF variants.


As part of my Google Summer of Code project, I implemented normalization of variants in the hgvs package. The normalization procedure contains two parts:
  1. trim the common prefix and suffix of reference allele and alternative allele;
  2. shuffle the variants to as 3’ most (or 5’ most) as possible.


In hgvs normalizer, we utilize a dynamic extension local window to perform the variants normalization. By default, the window size is 3 times length of the maximum length of reference allele and alternative allele. And the reference sequence and alternative sequence are reconstructed based on the UTA database and the variant itself. The trimming and shuffling process are based on the code of vgraph (https://github.com/bioinformed/vgraph). If the shuffling stops before reaching the cutting edge of the window boundary, or reaching the end of sequence, the normalization is finished. When the shuffling reaches the edge of the window, the normalizer will extend the window and continue the shuffling process, until the shuffling stops before reaching to the boundary of the window.







Here is a real example illustrating how one variant is normalized in hgvs:




Here is the summary of the normalizer in hgvs:

  • The normalizer is configurable in shuffling direction (3’ most or 5’ most) and whether allowing shuffling crossing the exon-intron boundary;
HGVS recommendations are to shuffle variants to the 3'-most position on the reference sequence. But for variants in VCF files, they are usually required to be shuffled to the 5’-most position. And currently there is no guidance about whether the shuffling should cross the exon-intron boundary or not in HGVS recommendations.


  • The priority of variant output type is dupN > dup > ins;
Fox example, when the reference is AGTTC and alternative sequence is AGTTTTC, the normalized result is c.4dup2, rather than c.3_4dupT or c.5_6insTT.


  • The normalizer in hgvs supports the normalization of all kinds of variants, except conversions and protein variants (p.).
When the crossing exon-intron boundary is disabled, variants that cross the exon-intron boundary could not be normalized, since that would cause confusing.



Variant normalization will appear in hgvs 0.4.0, due for release in August 2015.












No comments:

Post a Comment