Requirements for submitted VCF files

VarSome Clinical accepts VCF files for SNPs/INDEL and CNV annotation. You can upload VCFs containing only SNPs/INDELs or CNVs, but you can also upload VCFs containing both types of variants. 

If you upload mixed VCFs, they will be divided into two files: one file to annotate SNPs/small INDELs (*filtered.vcf.gz) and one file to annotate CNVs (*cnv.vcf.gz).

Required format for SNPs/INDELs annotation

VCFs containing SNPs and small INDELs can be used to launch a somatic or germline analysis: (Launch analysis > New analysis > Germline/Somatic analysis from VCF). 

The VCFs uploaded to analyze SNPs/small INDELs variants must have the following requirements:

  1. Are compliant with the VCF standard.
  2. Include only specific SNVs and INDELs. In order to annotate a variant, we need to know exactly what that variant is, so we cannot handle cases where the variant's sequence isn't specified. For example, we cannot annotate "NON_REF" variants:
    #CHROM    POS      ID     REF      ALT
    chr1 10052 . C   <NON_REF>     

    Or variants with an "N" in the ALT field:

    #CHROM      POS       ID    REF     ALT
    chr22  30998425  .     C    CTTTTTNT
  3. Include a valid genotype (GT) field for each variant entry.
  4. The files should contain the variants found in a real human sample. We expect a maximum of around 4 or 5 million variants in a sample.

Required format for CNVs annotation

VCFs containing CNVs (deletions and duplications) can be used to launch a CNV subanalysis from VCF. 

The VCFs uploaded to annotate CNV variants must have the following requirements:

  1. Are compliant with the VCF standard.
  2. Include duplications and/or deletions where the type of copy number variant is shown in the ALT field:
    ##ALT=<ID=DEL,Description="Deletion">

    ##ALT=<ID=DUP,Description="Duplication">

    Example of an accepted VCF with CNVs:

    #CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    SAMPLE1
    chr12    133040735    .    C    <DUP>    .    PASS    SVTYPE=DUP;SVLEN=140;END=133040875    GT:CN    0/1:1.50
    chr12    133049934    .    G    <DEL>    .    PASS    SVTYPE=DEL;SVLEN=78;END=133050012    GT:CN    0/1:0.50 
  3. According to the VCF Specification, the CNV category should not be used when a more specific category can be applied. 

    ##ALT=<ID=CNV,Description="Copy Number Variant">

    Therefore, the following VCF format is not accepted:

    chrX	133559227	.	G	<CNV>	.	.	SVTYPE=CNV;SVLEN=140;END=133559366	GT:FC:CN	0/1:-1.82:1.10
  4. Include a valid genotype (GT) field for each variant entry.
  5. Do not include other type of SV variants such as large chromosomal rearrangements (e.g. inversions, translocations) or gene fusions. We currently do not support these type of SV variants.


❗ Tip: checking the format of a VCF file

Ensuring that your VCF file is structured correctly and ready to be uploaded to VarSome Clinical is a recommended practice that could facilitate your analyses and save valuable time.

An easy way to check that your VCF file is valid is to try to run a bcftools command on it. Bcftools, a set of utilities that manipulate VCF files, is very sensitive to malformed VCFs, so it will fail if the file doesn't conform to the standard.

After installing Bcftools according to the instructions, the following command can be executed, where file.vcf represents your input VCF file:

bcftools norm -m -any -NO v file.vcf

This command will attempt to perform certain actions: check that REF alleles match the reference, split multiallelic sites into multiple rows, or recover multiallelics from multiple rows. If the fields in your file are complete, the command will be executed smoothly. However, if it comes across a non-compliant field like the following,  

chr1    16366632        .       CC      GC,GT   193.02  PASS    AB=0.5;

the command will fail. In the row above, the field allelic balance (AB) is incomplete, as this is a multiallelic site with two alleles in a single row and two numbers are expected. This information will be provided with an error message: 

Error: wrong number of fields in INFO/AB at chr1:16366632, expected 2, found 1

Other alternatives to VCF validation:

which can be used to locate other types of errors (e.g. a malformed or missing header).

Another quick test is to just see if a standard program like bcftools recognizes the file and doesn't complain.