Module 1: Somatic Mutation Profiling

Every tumor carries its own unique set of DNA mutations. To design a personalized cancer vaccine, we first need to find those mutations. This module takes raw sequencing data from a patient's tumor and normal tissue, and produces a curated list of tumor-specific (somatic) mutations that could serve as vaccine targets.

Prepare Reference Genome

Think of a reference genome as a "master map" of an organism's DNA. When we sequence a patient's tumor, we get millions of short DNA fragments, but we don't know where each fragment belongs. The reference genome tells us where to place each piece.

CanFam3.1^[1] is the standard reference genome for dogs (Canis lupus familiaris), maintained by Ensembl. It contains the complete DNA sequence of the dog genome, totaling about 2.4 billion base pairs organized into 38 pairs of autosomes plus sex chromosomes. This is the coordinate system we'll use throughout the entire pipeline.

Once downloaded, we build several index files. These are like a book's table of contents: they let our tools jump directly to any genomic region instead of scanning the entire file. Different tools need different index formats, so we create one for BWA (alignment), one for samtools^[6](region lookup), and a sequence dictionary for GATK^[7] (variant analysis).

We also download dbSNP^[2], a public database of known genetic variants. These are common, well-characterized variants seen across many dogs. Later in the pipeline, we'll use this database to help distinguish real biological variation from sequencing errors. If a variant matches a known dbSNP entry, it's more likely to be real.

# Download reference genome
wget $REFERENCE_URL
gunzip Canis_lupus_familiaris.CanFam3.1.dna.toplevel.fa.gz
mv Canis_lupus_familiaris.CanFam3.1.dna.toplevel.fa $REFERENCE_FA

# Index with BWA and samtools
bwa index $REFERENCE_FA
samtools faidx $REFERENCE_FA

# Create sequence dictionary
gatk CreateSequenceDictionary -R $REFERENCE_FA -O ${REFERENCE_FA%.fa}.dict

# Download & Index dbSNP
wget $DBSNP_URL
mv canis_lupus_familiaris.vcf.gz $DBSNP_VCF
tabix -p vcf $DBSNP_VCF

# Download VEP cache for offline annotation
wget -P $VEP_CACHE \
  https://ftp.ensembl.org/pub/release-104/variation/vep/canis_lupus_familiaris_vep_104_CanFam3.1.tar.gz
tar -xzf $VEP_CACHE/canis_lupus_familiaris_vep_104_CanFam3.1.tar.gz -C $VEP_CACHE

wgetbwa indexsamtools faidxgatk CreateSequenceDictionarytabix

Download Paired WES Data

To find tumor-specific mutations, we need two samples from the same patient: the tumor tissue and a matched normal tissue (e.g., blood or healthy tissue). By comparing the two, we can tell which mutations are unique to the tumor (somatic) and which ones the patient was born with (germline). Only the somatic mutations are relevant for vaccine design.

Both samples undergo Whole-Exome Sequencing (WES), which reads the DNA of all protein-coding regions (about 1-2% of the genome). Since most cancer-driving mutations occur in these coding regions, WES is a cost-effective way to find them without sequencing the entire genome.

The raw sequencing data is stored in the NCBI Sequence Read Archive (SRA)^[3], a public repository for sequencing data. We download it using fasterq-dump, which automatically splits each sample into two files: forward reads (R1) and reverse reads (R2). These paired reads give us more confidence in alignment because they come from opposite ends of the same DNA fragment.

# Download and split paired-end fastq from SRA
fasterq-dump --split-files --threads $THREADS $SRR_TUMOR
fasterq-dump --split-files --threads $THREADS $SRR_NORMAL

fasterq-dump (SRA Toolkit)

QC & Trimming

Raw sequencing data isn't ready to use straight out of the machine. Each read may carry leftover adapter sequences, which are short synthetic DNA fragments that were added during library preparation to attach reads to the sequencer. These aren't part of the patient's actual DNA and need to be removed.

Quality also tends to drop toward the tail end of each read. If we leave these low-quality bases in, they'll cause misalignments and false variant calls later. This step acts as a quality gate: we clean the data now so every downstream step can trust it.

We use fastp^[4], an ultra-fast all-in-one tool that handles adapter removal, quality trimming, and filtering in a single pass. By looking at read overlap patterns, it can automatically detect and remove adapters without needing us to specify their exact sequences. It also trims any bases that fall below a Phred quality score of 20 (which corresponds to roughly 99% base-call accuracy), and entirely discards reads that end up shorter than 36 base pairs, as those are too short to be mapped uniquely to the reference genome later on.

After processing, fastp generates per-sample HTML reports with interactive charts showing quality distributions, adapter content, and how many reads survived filtering. These reports are a great sanity check before moving on to the time-intensive alignment step.

# Trim adapters and low-quality bases
fastp \
  -i raw/${SRR}_1.fastq -I raw/${SRR}_2.fastq \
  -o trimmed/${SRR}_1.trimmed.fastq.gz -O trimmed/${SRR}_2.trimmed.fastq.gz \
  --thread $THREADS \
  --qualified_quality_phred 20 \
  --length_required 36 \
  --detect_adapter_for_pe \
  --json qc/${SRR}_fastp.json --html qc/${SRR}_fastp.html

fastp

Alignment & BAM Processing

Now we take each cleaned read and figure out where it belongs on the reference genome. This process is called alignment(or mapping), and it's the most compute-intensive part of the pipeline. We use BWA-MEM^[5], the gold-standard aligner for short Illumina reads, which can handle insertions, deletions, and partial matches.

After alignment, reads are sorted by their genomic position and stored in a BAM file, a compressed binary format that most genomics tools understand. But the BAM isn't ready yet. During library preparation, some DNA fragments get copied multiple times by PCR. These PCR duplicates look like independent evidence but are really clones of the same molecule. We flag them with MarkDuplicatesso they don't artificially inflate our confidence in a variant.

Next comes Base Quality Score Recalibration (BQSR). The sequencer assigns each base a confidence score, but these scores aren't always accurate. The machine tends to be over- or under-confident in systematic ways. BQSR uses the known variants from dbSNP (downloaded in Step 01) to learn these error patterns and correct them. This makes downstream variant calling significantly more reliable.

Finally, we generate summary statistics: mapping rate (what percentage of reads aligned successfully), duplication rate, and overall coverage. These numbers give us a quick health check on data quality before moving to variant calling.

# 1. Align and sort
bwa mem -t $THREADS \
  -R "@RG\tID:$SRR\tSM:$SRR\tPL:ILLUMINA\tLB:$SRR" \
  $REF \
  trimmed/${SRR}_1.trimmed.fastq.gz trimmed/${SRR}_2.trimmed.fastq.gz | \
  samtools sort -@ $THREADS -o alignment/${SRR}.sorted.bam -

# 2. Mark duplicates
gatk MarkDuplicates \
  -I alignment/${SRR}.sorted.bam -O alignment/${SRR}.sorted.markdup.bam \
  -M alignment/${SRR}.markdup_metrics.txt \
  --REMOVE_DUPLICATES false --CREATE_INDEX true

# 3. Base Quality Score Recalibration
gatk BaseRecalibrator \
  -R $REF -I alignment/${SRR}.sorted.markdup.bam \
  --known-sites $DBSNP \
  -O alignment/${SRR}.recal_data.table

gatk ApplyBQSR \
  -R $REF -I alignment/${SRR}.sorted.markdup.bam \
  --bqsr-recal-file alignment/${SRR}.recal_data.table \
  -O alignment/${SRR}.sorted.markdup.bqsr.bam

# 4. Index final BAM and collect stats
samtools index alignment/${SRR}.sorted.markdup.bqsr.bam
samtools flagstat alignment/${SRR}.sorted.markdup.bqsr.bam
samtools stats alignment/${SRR}.sorted.markdup.bqsr.bam

bwa memsamtools sortgatk MarkDuplicatesgatk BaseRecalibratorgatk ApplyBQSRsamtools indexsamtools flagstatsamtools stats

Somatic Variant Calling

This is the heart of Module 1, where we actually discover tumor-specific mutations. GATK Mutect2^[8]is a somatic variant caller specifically designed for cancer genomics. It looks at the tumor and normal BAMs side by side, position by position, asking: "Is this base different in the tumor compared to the normal?" If a variant is present in the tumor but absent in the normal, it's flagged as a somatic mutation.

Not every call from Mutect2 is real. Sequencing and alignment artifacts can mimic somatic mutations. FilterMutectCalls applies a series of statistical filters (checking for strand bias, contamination, and other common artifacts) to separate signal from noise. Only variants that pass all quality filters are retained.

The final output is a VCF (Variant Call Format) filecontaining high-confidence somatic mutations. Each entry describes a specific genomic position where the tumor differs from the normal, along with quality metrics and allele frequencies.

This curated variant list is the foundation for everything that follows. In the next modules, these mutations will be translated into protein sequences, checked for expression, and evaluated as potential neoantigen vaccine targets.

# 1. Somatic variant calling
gatk Mutect2 \
  -R $REF \
  -I $TUMOR_BAM -I $NORMAL_BAM \
  -normal $SRR_NORMAL \
  --germline-resource $DBSNP \
  -O variants/somatic.unfiltered.vcf.gz \
  --native-pair-hmm-threads $THREADS

# 2. Filter artifacts
gatk FilterMutectCalls \
  -R $REF \
  -V variants/somatic.unfiltered.vcf.gz \
  -O variants/somatic.filtered.vcf.gz

# 3. Extract purely PASS variants
bcftools view -f PASS variants/somatic.filtered.vcf.gz \
  -Oz -o variants/somatic.pass.vcf.gz
bcftools index -t variants/somatic.pass.vcf.gz

gatk Mutect2gatk FilterMutectCallsbcftools viewbcftools index

Variant Annotation

Finding somatic mutations is not enough. We need to know what each mutation actually does to a gene. Does it change an amino acid? Does it truncate a protein? Or does it fall in a non-coding region with no functional impact? This step uses Ensembl VEP (Variant Effect Predictor)^[10] to annotate every somatic variant with its predicted biological consequence.

VEP runs in offline mode using a locally cached copy of the Ensembl annotation database (release 104), matching our GTF. For each variant, it reports the affected gene, transcript, protein position, amino acid change, and a consequence type (e.g., missense_variant, frameshift_variant, stop_gained). The --everything flag enables all available annotations including SIFT and PolyPhen predictions.

After annotation, we filter for coding variants that alter protein sequence: missense mutations, frameshifts, in-frame insertions/deletions, stop gains, and start losses. These are the variants capable of producing abnormal proteins (neoantigens) that the immune system could recognize. Non-coding or synonymous variants are excluded since they do not change the protein and cannot serve as vaccine targets.

# Run Ensembl VEP annotation
vep \
  --input_file somatic.pass.vcf.gz \
  --output_file somatic.pass.annotated.vcf \
  --format vcf --vcf \
  --species canis_lupus_familiaris \
  --offline --cache \
  --dir_cache $VEP_CACHE \
  --cache_version 104 \
  --fasta $REF \
  --everything \
  --force_overwrite \
  --fork $THREADS

# Extract coding variants
grep -E "^#|missense_variant|frameshift_variant|inframe_insertion|inframe_deletion|stop_gained|start_lost" \
  somatic.pass.annotated.vcf > somatic.pass.coding.vcf

vep (Ensembl VEP)grep

Expected Metrics & QC Checkpoints

Before moving on to the next modules, it is essential to review the quality metrics from Module 1. Poor DNA sequencing depth or high contamination will cascade into false-positive neoantigen predictions later on. Check the generated HTML reports and GATK logs for these recommended thresholds:

1. Raw Reads (fastp)

Passing Filter> 90%
Q30 Bases> 85%
Adapter Content< 5%

If Q30 is below 80%, consider aggressive trimming or re-sequencing.

2. Alignment (BAM)

Mapping Rate> 95%
PCR Duplicates< 30%
Target Coverage
(Tumor / Normal)≥ 100X
≥ 50X

Low coverage severely limits the sensitivity of detecting low-frequency mutations.

3. Variants (VCF)

Somatic SNVs100 - 3,000+
(Highly variable)
Ti/Tv Ratio~ 2.0 - 2.1
Contamination< 2%

Extreme outliers in mutation count suggest normal tissue contamination or poor tumor purity.

References

Lindblad-Toh, K. et al. (2005). Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438, 803–819. doi:10.1038/nature04338 [CanFam3.1 reference genome]
Sherry, S.T. et al. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Research, 29(1), 308–311. doi:10.1093/nar/29.1.308 [dbSNP]
Leinonen, R. et al. (2011). The Sequence Read Archive. Nucleic Acids Research, 39, D19–D21. doi:10.1093/nar/gkq1019 [SRA / fasterq-dump]
Chen, S. et al. (2018). fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884–i890. doi:10.1093/bioinformatics/bty560 [fastp]
Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:1303.3997. arXiv:1303.3997 [BWA-MEM]
Li, H. et al. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. doi:10.1093/bioinformatics/btp352 [SAMtools]
McKenna, A. et al. (2010). The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research, 20(9), 1297–1303. doi:10.1101/gr.107524.110 [GATK]
Benjamin, D. et al. (2019). Calling somatic SNVs and indels with Mutect2. bioRxiv. doi:10.1101/861054 [Mutect2]
Danecek, P. et al. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10(2). doi:10.1093/gigascience/giab008 [BCFtools]
McLaren, W. et al. (2016). The Ensembl Variant Effect Predictor. Genome Biology, 17, 122. doi:10.1186/s13059-016-0974-4 [Ensembl VEP]