Data Summary

#COGVIC Genes in 24 Chromosomes

#COGVIC Cases in different Genes

Data Curated

The server of COGVIC was developed with the MediaWiki package, which is widely used to build wiki applications, including Wikipedia. All variation annotation pages were generated with an automated annotation pipeline describing bellow and briefly shown as figure on right, and they can only be updated by a COGVIC administrator.

We developed a variant identifying workflow named "COGVIC (Variants and Genetics in Cancer) analysis pipeline" to obtain our East Asian cancer-caused germline mutation list. Data quality control and pre-processing. We performed quality control checks on all raw sequence data by FastQC and performed adapter sequence trimming and quality filtering with Trim Galore software. The filtering and trimming parameters are listed as follows: -q 25 --phred33 --length 36 --stringency 3 –paired. Then, we performed quality control checks again on the cleaned data.

DNA data alignment and variant identification. For the genomics data derived from WGS-based, WES-based, and target-based libraries, the alignment to the hg19/ GRCH37 genome was performed with Bowtie2 software with the default settings [8]. After stringent quality assessment and data filtering, reads with Q20 bases (those with a base quality greater than 20) were selected as high-quality reads for further analysis. Then, variants were detected with Varscan at based on a p-value of 0.01 [9]. RNA data alignment and variant identification. STAR software [10] is used for RNA-seq data alignment with the GRCH38/hg38 genome. Then, variants are detected with GATK (version 3.6.0).

Variant quality control. For each of the alignment results, an indexed BAM file was generated using SAMtools. Each BAM file was provided as input to each variant caller to generate a VCF file of unfiltered variant calls. To remove low-quality variants, genotypes were required to have DP >6, and all variants with quality scores of less than 40 were removed. To obtain further filtered variant calls, GATK was run through variant quality score recalibration steps (VQSR), as documented at here. Variants that passed both quality filters, i.e., those flagged as PASS in the GATK VQSR Filter, were used for the following steps, such as annotation and clinical interpretation.

Variant filtration & annotation. The annotation step was conducted with one ANNOVAR [11] command for each subject ( annovar/humandb -buildver hg19 –remove –protocol refGene, avsnp150, 1000g2015aug_all, 1000g2015aug_eas, gnomad_exome_20190125, clinvar_20181225 -operation g,f,f,f,f,f). As common variants are usually easy to detect, we used ANNOVAR to filter the variants by minor allele frequency (MAF). All filtered and annotation criteria are listed as below:

  1. Variants with MAF>1% (GnomAD project, v2.1, 2018 release, East Asian population) were discarded.
  2. Synonymous, non-splicing or non-exonic variants were discarded.
  3. Damaging missense mutations were defined as deleterious by at least two of the following criteria based on several function prediction models: SIFT (Sorting Intolerant From Tolerant) score ≤0.05, Polyphen2(HDIV) score ≥0.95, Mutation Assessor ≥2, Phred transformed CADD (Combined Annotation-Dependent Depletion) score ≥15, placental mammal PhyloP ≥ 2.4,and vertebrate PhyloP ≥4.
  4. Since we incorporated variants from clinical databases, such as ClinVar, COSMIC, TAGA, and OMIM, those variants associated with a phenotype (such as a disease or risk factor for a cancer-related disease) were kept for our final mutation list.