2015-08-21

Support of Complex Variants in hgvs


The hgvs package has supported the parsing and manipulating (mapping, validating and normalizing) basic type of variants including substitutions, indels, insertions, deletions, duplications etc.

As part of my Google Summer of Code project, I implemented support for complex variants in hgvs, including compound variants, mosaic variants and chimeric variants. This feature will be incorporated in the hgvs 0.5.0 release.

Compound variants describe multiple variants in one individual. These variants may be on the same chromosome, or on different chromosome. For example, c.[76A>C; 83G>C] describes two changes found in one individual on the same chromosome and c.[76A>C];[83G>C] describes two changes on each chromosome – maternal and paternal. Mosaic variant is two or more different nucleotides in one position caused by somatic mutations, which is represented as c.[83G=/83G>C]. And chimeric variants describe multiple different nucleotides in one position but in different cells, for example c.[83G=//83G>C].

In summary, according to the recommendations of HGVS (http://www.hgvs.org/mutnomen/recs-DNA.html), such complex variants are described as:

Compound variant
AC:type.[first edit maternal;second edit maternal];[first edit paternal;second edit paternal]
AC:type. [first edit;second edit]
AC:type. [first edit(;)second edit]

Mosaic variant
AC:type.[edit 1/edit 2/edit 3]

Chimeric variant
AC:type.[edit 1//edit 2//edit 3]

These complex variants contain multiple posedits (position+edits). That’s the biggest difference to the simple variants, which only contains one posedit for each variant. And for the compound variants, we also need to store the phase information for each variant.

We modeled the complex variants as a list of simple variants, instead of only a list of posedit. Here is the model for representing complex variants in hgvs internally:





So each item in a complex variant is a simple variant, which has the access number, the type and one posedit. Here are the reasons we do this:
  1. Although in most cases the access number for all posedits in a complex variants is the same, when we perform variants mapping from a genomic variant (g.) to transcripts level (c.), the different posedits may map to different transcripts. Thus, each item in a complex variant should store its own accession. And we provide function that could check whether all the posedits in a complex variant have the same accession.
  2. Modeling the complex variants as a list of simple variants provides straightforward access to the subordinate variants by using the indexes.
  3. This also makes it much easier to map, validate and normalize these complex variants. These manipulation for complex variants becomes the manipulation of each simple variant one by one in a complex variant.

We also provide a complete set of functions that support the manipulation of complex variants as simple variants. The complex variants and simple variants have the same attributes and the same manipulation. The only difference is that the result given by complex variant is a list, while the result given by simple variant is a single value.



For complex variants var (compound variant, mosaic variant and chimeric variant):

AttributesResult class
var[0]SequenceVariant
var[0].poseditPosEdit
var[0].posedit.posLocation
var[0].posedit.editEdit
var.poseditPosEditSet (a list of PosEdit)
var.posedit.posa list of Location
var.posedit.edita list of Edit
var.acthe ac (if all ac is the same) or a list of ac
var.typethe type (all type should be the same)




No comments:

Post a Comment