CNV calling with WES or targeted panel data

VarSome Clinical currently offers Copy Number Variation (CNV) analysis for both Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES) or targeted panel data. For the non-WGS analyses we use ExomeDepth, a CNV caller based on a read depth approach. To accurately detect CNVs, ExomeDepth requires at least five samples (ideally around ten) that will be run as a cohort with each sample analyzed, using the rest as a pool to select reference samples.

Why are reference samples necessary in CNV calling for non-WGS samples?

The read depth approaches used for CNV calling in WGS usually assume that reads distribute in a more or less uniform way across the genome and, therefore, the differences in read depth are used to identify CNVs. However, this assumption fails in the context of WES and targeted sequencing. One of the main reasons is that the probes used for capturing the different targeted regions have variable specificity and efficiency depending on the region. This fact introduces strong biases in the number of mapped reads per region that hamper the CNV detection. ExomeDepth requires multiple samples because it uses them to control the biases given by the extensive variability in capture efficiency across exons and/or target areas. 

What characteristics must the reference samples have?

For optimal results, the reference set of samples must have the following characteristics in common with the test sample (sample of interest):

  • Samples should be prepared with the same library protocol and sequenced by the same sequencing platform.
  • All samples (the test one and the reference) should all have been generated as part of the same sequencing batch. It is possible to use samples generated in different batches but the resulting CNV calls are likely to be much less accurate.
  • Samples should originate from individuals unrelated to each other. For example, if samples come from the same family, related individuals should be excluded.
  • For CNV calls in sex chromosomes, all samples should be of the same sex (either all male or all female). If  they are not all of the same sex, calls on those chromosomes will not be reliable.

How are reference samples used by ExomeDepth

Each sample given as input for ExomeDepth analysis will be taken to call CNVs on it by using a selection of the remaining as reference samples. This means that, when running a CNV analysis with ExomeDepth, you will get calls for all of the input samples, no matter if you consider the sample as test or reference.

Another important key to bear in mind is that every input sample might not be compared to all other samples. Each input sample is compared against an optimized set of reference samples that are well correlated with it. The first step of the CNV calling process is to construct the reference set of samples. To do this, ExomeDepth takes one of the input samples and ranks the remaining by order of coverage correlation with the first sample. Then, the remaining samples are sequentially added to the reference set. After the addition of one sample to the reference set, a statistical calculation is performed to see how good is the current reference set to predict CNVs on the test sample. The addition of samples to the reference set stops when it is unable to improve the reference set power to predict CNVs. Therefore, using a high number of samples for CNV analysis does not necessarily increase the accuracy of the results because: 

  1. Not all available samples are included in the reference set, which means that not all the samples are used as reference for calling CNVs in the test sample.
  2. Some of the CNVs present in the test sample can be missed if they are shared between the test and the reference samples.

The reference selection process is automated in the analytical pipeline implemented in VarSome Clinical and does not require any additional steps by the user. VarSome Clinical SV pipeline will analyse all samples of the cohort successively and generate CNV calls for all.

ExomeDepth authors estimate that the optimum size of the reference set is ~10. Adding further samples in the reference set actually might decrease the power.

 

References

Plagnol V, Curtis J, Epstein M, Mok KY, Stebbings E, Grigoriadou S, et al. A robust model for read count data in exome sequencing experiments and implications for copy number variant calling. Bioinformatics. 2012;28(21):2747–2754.