2015-08-21

REST interface to the UTA database


Mapping, validating, and normalizing sequence variants requires access to diverse biological sequence data. For example, mapping variants between genomic and transcript coordinates requires coordinates for exon boundaries, and validating and normalize a variant requires access to reference sequences.

The hgvs Python package includes a pluggable interface to pull these data from remote sources. The default implementation of this interface uses a sister project, the Universal Transcript Archive (UTA). UTA stores human transcripts, sequences, exon structures and reference-transcript alignment details . Importantly, the coordinates are exactly as provided by NCBI (using splign), UCSC (using blat), and Ensembl (from the genebuild pipeline).

Currently, the UTA database is mainly used by the hgvs package to map variants between genomic, transcript, and protein coordinates. Using UTA requires users to install libpq, the PostgreSQL network protocol library, in their system and install the psycopg2 package if they want to access through Python. While this is not a burden for many, a REST interface would obviate these steps and simplify installation.

Here we implemented a REST proxy for the hgvs dataprovider. This makes it much easier for users to install the hgvs package, and also enables others to take advantage of UTA. Meanwhile, this REST interface also eliminates the dependency of libpq and psycopg2 package for most hgvs package users.

The REST interface is based on Flask (http://flask.pocoo.org/) and Flask-RESTful (https://flask-restful.readthedocs.org) framework. We choose Flask because it is a widely employed python web framework. The Flask-RESTful extension makes things easier to design the REST interface. It makes it possible to map the URLs and classes in a unified place. It also provides functionalities to deal with the query arguments and response fields.

In fact, the REST interface to UTA database is a thin wrapper of current dataprovider in hgvs. The overall architecture of the REST interface server is as follows:



I also made a docker image (https://hub.docker.com/r/icebert/uta_rest/) that integrates the UTA REST server and an uWSGI server. This makes things easier to deploy the REST server.

It is such a valuable transcripts database that would benefit the research of human transcripts and variants.

For example, if you want to find all the transcripts in a given genomic region (taking position 100000 to 200000 on chr20 for instance), you only need to make a query to the UTA using: http://api.biocommons.org/tx_for_region?alt_ac=NC_000020.10&alt_aln_method=splign&start=100000&end=200000

And if you want to find similar transcripts for a given transcript (NM_199425.2), what you need to do is simply query the UTA by: http://api.biocommons.org/similar_transcripts?tx_ac=NM_199425.2 This will return a list of transcripts that are similar to the given transcript, with relevant similarity criteria, in json format.



Here is a summary of the APIs provided in the UTA REST interface:
EndpointRequest argumentsDescription
data_version(None)UTA data version.
schema_version(None)database schema version.
tx_exonstx_ac
alt_ac
alt_aln_method
return transcript exon info for supplied accession.
tx_infotx_ac
alt_ac
alt_aln_method
return a single transcript info for supplied accession.
sequenceacFetches sequence by accession, optionally bounded by [start, end) .
tx_for_genegenereturn transcript info records for supplied gene, in order of decreasing length.
tx_for_regionalt_ac
alt_aln_method
start
end
return transcripts that overlap given region.
acs_for_protein_seqseqreturns a list of protein accessions for a given sequence.
gene_infogenereturns basic information about the gene.
tx_mapping_optionstx_acReturn all transcript alignment sets for a given transcript accession.
tx_identity_infotx_acreturns features associated with a single transcript.
similar_transcriptstx_acReturn a list of transcripts that are similar to the given transcript, with relevant similarity criteria.
pro_ac_for_tx_actx_acReturn the (single) associated protein accession for a given transcript accession.





No comments:

Post a Comment