# Variant Annotation and Filtering
Note
Hard-coded filter thresholds are viewable and editable in the configuration files here: * Exomes * Genomes
Be aware
- These components of the pipeline are subject to constant change.
- Users should be aware of the pitfalls and challenges of filtering somatic variant calls, which are not further discussed here.
# Somatic SNVs and Indels
Variant-level annotation, filtering, and flagging of variants with further filter flags occur in the SomaticCombineChannel
and SomaticAnnotateMaf
processes. The union of variants that pass the somatic scoring models intrinsic to the callers (FILTER="PASS"
in the VCF files) are combined, giving precedence to MuTect2 for any site where both callers detected a variant.
The functional effect of variants is predicted using VEP (opens new window) using vcf2maf (opens new window), which also converts from VCF into a tab-delimited MAF file. See notes on use of preferred transcript isoforms and VEP annotation outputs (opens new window).
The following columns are added to the final MAF file, in addition to those added during the VEP annotation:
Strelka2FILTER
: Indicates that Strelka2 detected the variant but did not classify it as a somatic variant.gnomAD_FILTER
: Indicates that the variant was detected in the gnomAD workflow, but ultimately not classified as a germline variant. Note that this is not used in current filtering schema.RepeatMasker
andEncodeDacMapability
: The variant locus is in a repeat, low-mappability, or hard-to-sequence region. More details in the reference file description.PoN
: Panel-of-normals, the number of normal samples in which the variant was detected. More details on the implementation for exomes. Under development for genomes, currently applies exome PoN.Ref_Tri
: Trinucleotide context of SNVs, normalized to pyrimidine-to-purine transversions.- gnomAD allele frequencies, more details here:
- Exomes, columns named
non_cancer_*
: Allele counts (AC
) and frequencies (AF
) for the variant in the non-TCGA population and sub-populations of gnomAD. - Genomes, columns named (
AC_*
|AF_*
): Allele counts and frequencies for the variant in populations and sub-populations of gnomAD.
- Exomes, columns named
- Raw allele counts: total and strand specific (
*_fwd
and*_rev
) allelecount
anddepth
for tumor (t_*
) and normal (n_*
). These are unfiltered values, in contrast to those from the variant callers, generated by GetBaseCountsMultiSample (opens new window). alt_bias
: For variants with a raw depth of at least 5 reads, this is true if all raw variant-supporting reads are on either the forward or reverse strand.ref_bias
: For variants with a raw depth of at least 5 reads, this is true if all raw reads are on either the forward or reverse strand.- Mutation hotspots, also see :
snv_hotspot
: SNV hotspots.threeD_hotspot
: 3D hotspots, in contrast to abovementioned "linear" hotspots.indel_hotspot
andindel_hotspot_type
: In-frame indel indel hotspots, and indication of whether it overlaps a prior indel hotspot locus (prior
) or overlaps an SNV hotspot (novel
).Hotspot
:TRUE
if either a linear SNV or indel hotspot.
- OncoKB annotation:
mutation_effect
andoncogenic
: Indicate the functional effect of the mutation and whether it is deemed to be oncogenicLEVEL_*
andHighest_level
: Indicates whether there is any drug at the given level of actionability and which is the highest level of actionability, if any. Note that this is cancer-type agnostic in current implementation.citations
: References for OncoKB annotation. ::: Be aware
- Because TEMPO is blind of the cancer type (
ONCOTREE_CODE
) when running, so Level1s and 2As can not be annotated. The highest level will be Level 2B. You will need to re-runoncokb_annotator/MafAnnotator.py
with this information to get detailed oncokb level. :::
- Variant caller metadata (development feature, subject to changes):
- MuTect21:
MBQ
: Median base quality, comma-separated for reference and alternate allele.MFRL
: Median fragment length, comma-separated for reference and alternate allele.MMQ
: Median mapping quality, comma-separated for reference and alternate allele.MPOS
: Median distance of variant from end of read.OCM
: Number of reads whose original alignment does not match the reference.RPA
: If tandem repeat, number of times repeated (can be comma-separated for reference and alternate allele).STR
: Boolean, indicating that variant is a short tandem repeat.ECNT
: Number of events in haplotype.
- Strelka22:
MQ
: Root mean square mapping quality.SNVSB
: Strand bias for somatic SNVs.FDP
: Number of basecalls filtered from original read depth for tier 1* read counts, for tumor (t_FDP
) and normal (n_FDP
).SUBDP
: Number of reads below tier 1 mapping-quality threshold aligned across site, for tumor (t_SUBDP
) and normal (n_SUBDP
).RU
: If indel, smallest repeating sequence unit in inserted or deleted sequence.IC
: If indel, number of timesRU
is repeated in variant.
- MuTect21:
The FILTER
column in the unfiltered MAF file, can contain any semi colon-separated combination of the following filter flags, or say PASS
:
part_of_mnv
: The variant is likely part of another called multi-nucleotide variant (MNV).multiallelic2
: Multiallelic loci, likely artifact. For variants called by Strelka2. The2
is added due the presence ofmultiallelic
flag in the MuTect2 VCFs.strand_bias
, variants likely artifactual due to strand bias:- For variants called by Mutect2, if all supporting reads come from one strand and there are a least 10 reads on both strands in either normal or tumor sample.
- For variants called by Strelka2, if the total alternate read count is above 10 and all of these fall on either strand; or low mapping-quality variant suffering from bias in both supporting reads and total reads.
caller_conflict
: Variant was detected by both callers, but did not pass Strelka2's thresholds for somatic variant calling.- The following read depth-based flags are parameterized according to the sequencing platform, see the
exome.config
andgenome.config
files.low_vaf
: Variant falls below lower threshold for tumor variant allele fraction (VAF).low_t_depth
: Variant falls below lower threshold for total depth in the tumor.low_t_alt_count
: Variant falls below lower threshold for reads supporting variant allele in tumor.low_n_depth
: Variant falls below lower threshold for total depth in normal.high_n_alt_count
: Variant exceeds upper threshold for reads supporting variant allele normal.mappability
/repeatmasker
: Variant falls in blacklisted genomic region.high_gnomad_pop_af
: Variant exceeds upper threshold for allele fraction in gnomAD.PoN
: Variant exceeds upper threshold for count in panel of normals.low_mapping_quality
: For indels called by Strelka2, variant falls below lower mapping quality threshold.
1See the MuTect2 documentation for more information: https://software.broadinstitute.org/gatk/documentation/article?id=11005 (opens new window)
2See the Strelka2 documentation for more information: https://github.com/Illumina/strelka/blob/v2.9.x/docs/userGuide/README.md (opens new window)
# Whitelisting
Mutational hotspots, where the value in Hotspot
is TRUE
, are retained in the filtered MAF file, if they:
- Are flagged with
low_vaf
but the tumor VAF is at least 0.02. - Are flagged with
low_mapping_quality
,low_t_depth
, orstrand_bias
.
Note: Combinations of above filter flags results in filtering of the variant.
# Clonality and Zygosity Analyses
# Clonality
Clonality of SNVs and indels is estimated based on prior literature (opens new window) using facets-suite (opens new window). The cancer-cell fraction (CCF) annotation (columns ccf_*
) contains these estimates for three presumed copy-number configurations of the mutation:
- Inferred CCF if mutation exists in number of copies expected from observed VAF and local ploidy.
- Inferred CCF if mutation is on the major allele.
- Inferred CCF if mutation exists in one copy. For each of these, error intervals and probabilities are provided.
# Zygosity
Tumor zygosity of SNVs and indels is estimated using the observed VAF and the expected VAF at the observed tumor purity and local copy number.
# Germline SNVs and Indels
Variant-level annotation, filtering, and flagging of variants with further filter flags occur in the GermlineCombineChannel
and GermlineAnnotateMaf
processes. The union of variants that pass the filters intrinsic to the callers (FILTER="PASS"
in the VCF files) are combined, giving precedence to HaplotypeCaller for any site where both callers detected a variant. See discussion elsewhere regarding single-sample filtering of HaplotypeCaller variant calls (opens new window).
Functional effect predication and MAF file conversion is carried out as described above for somatic calls.
In the final MAF, columns Strelka2FILTER
, gnomAD_FILTER
, RepeatMasker
and EncodeDacMapability
as well as allele frequencies and counts from gnomAD are identical as described for somatic variants. Note that the gnomAF_FILTER
is used for filterig of germline variants, unlike for somatic variants.
In addition, the following columns are added to the germline MAF:
- BRCA exchange (opens new window) annotation:
brca_exchange_id
: Variant ID.brca_exchange_enigma
: Annotation from the ENIGMA consortium.brca_exchange_clinvar
: Annotation from ClinVar.
ch_gene
: Boolean indicating whether the gene is associated with the presence of clonal hematopoiesis (CH) (ASXL1, ATM, BCOR, CALR, CBL, CEBPA, CREBBP, DNMT3A, ETV6, EZH2, FLT3, GNAS, IDH1, IDH2, JAK2, KIT, KRAS, MPL, MYD88, NF1, NPM1, NRAS, PPM1D, RAD21, RUNX1, SETD2, SF3B1, SH2B3, SRSF2, STAG2, STAT3, TET2, TP53, U2AF1, WT1, and ZRSR2)- The following read depth-based flags are parameterized according to the sequencing platform, see the
exome.config
andgenome.config
files.low_n_depth
: Variant falls below lower threshold for total depth in normal.low_n_vaf
: Variant falls below lower threshold for normal VAF.ch_mutation
: Variant occurs in CH gene and occurs below lower threshold for normal VAF and below tumor VAF 0.25.t_in_n_contamination
: Tumor VAF is more than three-fold the normal VAF.
# Zygosity Analysis
Similar to somatic mutations, tumor zygosity of germline SNVs and indels is estimated using the observed VAF and the expected VAF at the observed tumor purity and local copy number. The difference between the two cases is the calculation of the expected tumor VAF of the variant.
# Somatic and Germline SVs
Tempo uses two to four callers to identify structural variants. By default, the following callers are used regardless of assay type for both somatic and germline analysis:
When the assay type is WGS, the following caller is used in addition for somatic variant calling only:
The SV workflow in Tempo is significantly influenced by the 2020 PCAWG publication on whole genomes (opens new window). Similar to the workflow described in their paper, calls from each structural variant are provided to mergesvvcf (opens new window), which converts each call to a normalized representation and merges them using a fixed window size of 200bp. Any two calls for which each breakpoint is less than 200bp away and matches relative directionality can be merged.
# Filtering and annotating structural variant calls
Using the read support information reported in Delly and Manta, the variants from those callers are subject to the following filters:
tumor_read_supp
: The variant is supported by less than 5 discordant reads or less than 2 split reads in the tumor sample.normal_read_supp
: The variant is supported by any number of reads in the normal sample.
From the merged callset, any variant is filtered based on a minimum number of supporting callers (1 for exome, 2 for genome). If a caller produced a filter flag for the variant, it is not considered to be a supporting caller.
The merged callset is converted from vcf to bedpe using svtools (opens new window) and the following filters are applied:
mappability
andrepeat_masker
: One or both breakends is in a repeat, low-mappability, or hard-to-sequence region. More details in the reference file description.pcawg_blacklist_bed
: One or both breakends is in a region that PCAWG has blacklisted.pcawg_blacklist_bedpe
: The breakpoint is blacklisted by PCAWG.pcawg_blacklist_fb_bedpe
: The breakpoint is blacklisted by PCAWG and is likely a foldback artefact.pcawg_blacklist_te_bedpe
: The breakpoint is blacklisted by PCAWG and is likely a transposable element.
The bed and bedpe files used for the flags pcawg_blacklist_bed
,pcawg_blacklist_bedpe
, pcawg_blacklist_fb_bedpe
and pcawg_blacklist_te_bedpe
are sourced from the SV merging tool used in the PCAWG paper (opens new window).
In addition to filtering, Tempo also annotates the merged callset using the iAnnotateSV package (opens new window), and identifies possible cDNA contamination among deletion events that span splice sites. Possible cDNA contamination sites are not filtered.
# Structural Variant Classes
Each breakpoint is described on a single record of the bedpe file, with the coordinates and orientation of both breakends described. The four types of breakends produced by Tempo are as follows:
Class | Abbreviation | Description |
---|---|---|
Breakend | BND | Any event that cannot be described with one of the below terms. The majority of BND are usually translocations. |
Deletion | DEL | Loss of a segment that is spanned by two joined breakends either side. |
Tandem Duplication | DUP | Extra copy of a segment immediately downstream of the template in the same orientation. |
Inversion | INV | A segment inserted into its original position, but in the opposite orientation. Simple inversions are balanced, but in complex inversions the second side of a dsDNA break may not be rescued. |
# BEDPE Format
After merging the variants from different callers, the variants are converted from vcf to bedpe file using svtools vcftobedpe (opens new window). Many downstream tools require bedpe or similar table formats. The PCAWG Working Group also makes use of the bedpe format. You can find more information about the bedpe file format here (opens new window).