Requirements for submitted VCF files

VarSome Clinical (platform and installations) accepts VCF files that:

  1. Are compliant with the VCF standard.
  2. Include only specific SNVs and INDELs. In order to annotate a variant, we need to know exactly what that variant is, so we cannot handle cases where the variant's sequence isn't specified. For example, we cannot annotate "NON_REF" variants:
    #CHROM  POS       ID       REF  ALT
    chr1 10052 . C   <NON_REF>     

    Or variants with an "N" in the ALT field:

    #CHROM   POS       ID      REF  ALT
    chr22  30998425  .       C    CTTTTTNT
  3. Include a valid genotype (GT) field for each variant entry
  4. Do not include Copy Number Variants (CNVs) or Structural Variants (SVs). A file containing SVs / CNVs would usually include a header like:
    ##INFO=<ID=SVTYPE, Number=1, Type=String, Description="Type of structural variant">
  5. The files should contain the variants found in a real human sample. We do not support gVCF files with information on all sites in a sample genome, including those that don't differ from the reference. We expect a maximum of around 4 or 5 million variants in a sample.

❗Tip: checking the format of a VCF file

Ensuring that your VCF file is structured correctly and ready to be uploaded to VarSome Clinical is a recommended practice that could facilitate your analyses and save valuable time.

An easy way to check that your VCF file is valid is to try to run a bcftools command on it. Bcftools, a set of utilities that manipulate VCF files, is very sensitive to malformed VCFs, so it will fail if the file doesn't conform to the standard.

After installing Bcftools according to the instructions, the following command can be executed, where file.vcf represents your input VCF file:

bcftools norm -m -any -NO v file.vcf

This command will attempt to perform certain actions: check that REF alleles match the reference, split multiallelic sites into multiple rows, or recover multiallelics from multiple rows. If the fields in your file are complete, the command will be executed smoothly. However, if it comes across a non-compliant field like the following,  

chr1    16366632        .       CC      GC,GT   193.02  PASS    AB=0.5;

the command will fail. In the row above, the field allelic balance (AB) is incomplete, as this is a multiallelic site with two alleles in a single row and two numbers are expected. This information will be provided with an error message: 

Error: wrong number of fields in INFO/AB at chr1:16366632, expected 2, found 1

Other alternatives to VCF validation:

which can be used to locate other types of errors (e.g. a malformed or missing header).