VarSome Clinical accepts VCF files containing only CNVs or a mix of CNVs and other SVs. The VCFs may contain the following types of variants:
-
- CNVs: deletion and duplication
- Insertions
- Inversions
- Breakends
- Repeat expansions
For more information about the SV annotation from VCF files please refer to the document SV annotation (from VCF).
Users may also optionally upload an alignment BAM file for the VCF sample which can be used to visualize the coverage of the variants provided in the VCF file.
The VCF files should conform to the VCF standard, regardless of the sequencing platform.
Required format for SNPs/INDELs annotation
VCFs containing SNPs and small INDELs can be used to launch a somatic or germline analysis: (Launch analysis > New analysis > Germline/Somatic analysis from VCF).
The VCFs uploaded to analyze SNPs/small INDELs variants must have the following requirements:
- Are compliant with the VCF standard.
- Include only specific SNVs and INDELs. In order to annotate a variant, we need to know exactly what that variant is, so we cannot handle cases where the variant's sequence isn't specified. For example, we cannot annotate "NON_REF" variants:
#CHROM POS ID REF ALT
chr1 10052 . C <NON_REF>Or variants with an "N" in the ALT field:
#CHROM POS ID REF ALT
chr22 30998425 . C CTTTTTNT - Include a valid genotype (GT) field for each variant entry.
- The files should contain the variants found in a real human sample. We expect a maximum of around 4 or 5 million variants in a sample.
Required format for SVs annotation
VCFs containing CNVs (deletions and duplications) and other SVs (insertions, inversions and breakends) can be used to launch an SV sub-analysis from VCF.
The VCFs uploaded to annotate SV variants must have the following requirements:
- Are compliant with the VCF standard.
- Include duplications and/or deletions where the type of copy number variant is shown in the ALT field or insertions, inversions and/or breakends:
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS,Description="Insertion">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Breakend">Example of an accepted VCF with CNVs:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1
chr12 133040735 . C <DUP> . PASS SVTYPE=DUP;SVLEN=140;END=133040875 GT:CN 0/1:1.50
chr12 133049934 . G <DEL> . PASS SVTYPE=DEL;SVLEN=78;END=133050012 GT:CN 0/1:0.50
chrX 127845659 . N <INV> 60 GT IMPRECISE;SVTYPE=INV;SVLEN=59;END=127845718;SUPPORT=5 GT:GQ:DR:DV 0/0:36:31:5
chr1 4939486 . T <INS> 406 PASS END=4939486;SVTYPE=INS;CIPOS=0,11;CIEND=0,11;HOMLEN=11;HOMSEQ=GATATCAATAT;LEFT_SVINSSEQ=GATATCAATATTCTCCATATGACTTCAGTGTCCTCCATATGACATCAATATCCTCCATATGATGTCAATATCTTC;RIGHT_SVINSSEQ=GTATGATGTCAATATCCTCCATATGATGTCAACATCATCCATATGATTTCAGTGTCCTCCGTATGATGTCAATGTCCTCCATAA GT:FT:GQ:PL:PR:SR 1/1:PASS:39:459,42,0:0,2:0,22
chr6 9500791 . N N[chr14:82674769[ 60.0 PASS PRECISE;SVTYPE=BND;SUPPORT=10;RNAMES=575d2f99-ea85-4dd3-a8f6-905e82d20947,ec3d723e-a783-4ee6-8fd2-ce44b4dddbf7,9c0f1346-ce9d-4d1a-b35b-9a09fda784d1,5ed8794a-9c95-4dc8-90e5-ca76641d6fd6,784251f9-d4f1-4782-ae3b-cca7c87abe95,62559b4a-d107-4f9e-8cff-e2762b068338,6b2fa5e8-6baa-4be6-a1e6-e8a7d0373cde,c452c1f0-9dd9-496b-bd92-812ce84919e1,b7a4d531-de79-45ec-84b6-8967b927c7ec,a0984676-e8c5-4314-b9a2-b7919ded8ccd;COVERAGE=0,0,40,40,40;STRAND=-;AF=0.25;CHR2=chr14;STDEV_POS=0;ANN=N[CHR14:82674769[|transcript_ablation|HIGH|LINC02301|LINC02301|transcript|NR_146650.1|pseudogene||t(6%3B14)(%3B)(n.*12363966)|t(6%3B14)(%3BNR_146650.1:null)||||| GT:GQ:DR:DV 0/1:16:30:10 -
According to the VCF Specification, the CNV category should not be used when a more specific category can be applied.
##ALT=<ID=CNV,Description="Copy Number Variant">
Therefore, the following VCF format is not accepted:
chrX 133559227 . G <CNV> . . SVTYPE=CNV;SVLEN=140;END=133559366 GT:FC:CN 0/1:-1.82:1.10
- Include a valid genotype (GT) field for each variant entry.
- Do not include other type of SV variants such as large chromosomal rearrangements (e.g. inversions, translocations) or gene fusions. We currently do not support these type of SV variants.
❗ Tip: checking the format of a VCF file
Ensuring that your VCF file is structured correctly and ready to be uploaded to VarSome Clinical is a recommended practice that could facilitate your analyses and save valuable time.
An easy way to check that your VCF file is valid is to try to run a bcftools command on it. Bcftools, a set of utilities that manipulate VCF files, is very sensitive to malformed VCFs, so it will fail if the file doesn't conform to the standard.
After installing Bcftools according to the instructions, the following command can be executed, where file.vcf represents your input VCF file:
bcftools norm -m -any -NO v file.vcf
This command will attempt to perform certain actions: check that REF alleles match the reference, split multiallelic sites into multiple rows, or recover multiallelics from multiple rows. If the fields in your file are complete, the command will be executed smoothly. However, if it comes across a non-compliant field like the following,
chr1 16366632 . CC GC,GT 193.02 PASS AB=0.5;
the command will fail. In the row above, the field allelic balance (AB) is incomplete, as this is a multiallelic site with two alleles in a single row and two numbers are expected. This information will be provided with an error message:
Error: wrong number of fields in INFO/AB at chr1:16366632, expected 2, found 1
Other alternatives to VCF validation:
- https://github.com/EBIvariation/vcf-validator
- http://vcftools.sourceforge.net/perl_module.html#vcf-validator
which can be used to locate other types of errors (e.g. a malformed or missing header).
Another quick test is to just see if a standard program like bcftools recognizes the file and doesn't complain.