VarSome API Cost Calculation

How to use VarSome API efficiently and at a low cost.

Executive Summary

The VarSome API is an incredibly simple & powerful tool allowing a developer to instantly access the 125+ databases integrated into VarSome along with automated ACMG or AMP classifications.

It is possible to annotate whole exomes or even whole genomes extremely efficiently and at a low cost using the VarSome API, with options to add VarSome’s automated ACMG & AMP classifications, or pull in additional data from specific databases of interest.

The most cost-effective results will be achieved by only annotating a subset of variants: coding variants or those close to a canonical splice-point (flanking within 10bp for example). You can use tools such as bedtools combined with a GFF3 file for your genome of interest.

The VarSome API includes a filter to automatically remove high-frequency variants.

In our tests, we annotated a sample exome, then restricted the annotation to only coding/splicing variants, and then finally filtered the remaining variants using the ACMG BA1 benign frequency threshold of 5%.

 

Variants

ACMG

ACMG + AMP

All Data

Coding or flanking

24 954

13.6 KB/variant

21.1 KB/variant

131. KB/variant

Freq < 5%

4 205

11.5 KB/variant

17.9 KB/variant

161. KB/variant

Non-coding variants usually have substantially less data each, and intergenic variants even less.

Results will vary depending on your pipeline and specific use-case, which additional databases you require, annotating tumor samples, or multiple family members. You can of course cache previous results to avoid re-annotating common variants in a family group or cohort.

As of March 2021, annotating a whole-exome using ACMG for coding & flanking variants with less than 5% frequency would incur costs of approximately $15 (€12, CHF 13). Furthermore, API pricing is degressive and reduces the more you use it.

Pricing

Currently, the API is priced based on the amount of data exchanged. All data is returned in JSON format. This document aims to explain some of the intricacies and how to keep the costs extremely low.

The following gives some recommendations on how to use the API efficiently, the full reference guide is available at https://api.varsome.com/.

Options

The API gives full control to the user: in order to keep costs reasonable or very low, you need to decide which data you actually need. This is controlled by the following options:

  • add-ACMG-annotation: if True, the responses will include the minimum set of databases required for our Germline classification.
  • add-AMP-annotation: if this option is enabled, it will add all the cancer databases required for AMP, on top of those used by ACMG.
  • add-all-data: use this flag sparingly as it will add all possible annotations from all sources. This can be useful to find out which sources you might like to include (or exclude) but will incur higher costs.
  • expand-pubmed-articles: if set, this will add a dictionary containing all the publications referenced in the annotation, including title, abstract, authors, journal, identifiers etc.
  • allele-frequency-threshold: this is a filter that can dramatically reduce the volume of annotations: any variant whose gnomAD genomes allele frequency is greater than the provided threshold will not be annotated.
  • add-source-databases: this takes a comma-separated list of database names as its argument, and will add these to the annotation. This is the best use in conjunction with the preceding flags if you require some additional detail.

Data-Sizes

PubMed Articles

Only use “expand-pubmed-articles” on a very small set of variants, the large amount of text in abstracts will rapidly make this option unviable. We recommend using the API call “pubmed_info” instead of on a case-by-case basis and then caching the results in your own database - for example https://api.varsome.com/pubmed_info/12345:45678.

Allele Frequency Threshold

This is a very useful tool, set it at 0.05 for example to remove all BA1 variants. We find that on average this reduces the amount of data by a factor of 4x to 6x.

Data Sizes

We annotated a whole-exome to measure the current data sizes returned by the API:

 

Variants

ACMG

ACMG + AMP

All Data

Coding or flanking

24 954

13.6 KB/variant

21.1 KB/variant

131. KB/variant

Freq < 5%

4 205

11.5 KB/variant

17.9 KB/variant

161. KB/variant

  • Coding variants return on average 2x as much data as non-coding variants.
  • Enabling AMP incurs a 60% overhead over ACMG as many additional cancer databases are included.
  • Enabling “all data” increases the JSON by a factor of 10 or more and should only be used extremely conservatively on the subset of variants of interest.
  • Expanding publications to retrieve the title & abstract should be done externally to annotation, maybe specifically on the subset of variants of interest. Furthermore, the publication data should be cached as it will not change.

Caching

We recommend that you cache frequently used data:

  • Caching variant annotations is possibly not so useful as new information or improvements to our classification algorithms could be missed unless periodically refreshed.
  • Variants: it may be useful to know that you have seen a variant before in another sample for which you can use the unique 64-bit “variant_id” assigned by VarSome to each variant (equivalent variants are given the same variant_id).
  • Publications: it is definitely worthwhile building your own cache of publication information, you can either download these directly from NCBI PubMed, or alternatively use the expanded data from VarSome.
  • Genes: when storing annotations for a given sample, you may like to extract the gene annotations which will be identical for every variant in a given gene. We use this for our own VarSome Clinical platform as it can reduce space by 75%.

Questions

We hope these guidelines are helpful to extract maximum value from the hugely powerful VarSome API. Do let us know if there’s any further information we can provide.