5. Variant calling

In this section we will cover:

You will learn to:

vcalling

Working directory

Before we start, set up your wd and go into it:

mkdir -p ~/Course_Materials/wd/day2
cd ~/Course_Materials/wd/day2

And make the following directories:

mkdir variant_calling
mkdir variant_calling/snvs
mkdir variant_calling/svs
mkdir annotation
mkdir circos

You can also define the following variables that we will use later for convienience:

LRS_bam=~/Course_Materials/data/variant_calling/bams/LRS_alignment.bam
SRS_bam=~/Course_Materials/data/variant_calling/bams/SRS_alignment.bam
SRS_snvs=~/Course_Materials/data/variant_calling/snvs/SRS_SNVs.vcf.gz
ref=~/Course_Materials/data/variant_calling/reference_genome/Homo_sapiens.GRCh38.dna.fasta

Single Nucleotide Variant Calling

In this section we will identify SNVs from long-read sequencing data using Longshot.

Variants are called and stored in VCF format. This contains a header, and then data lines each containing information about a position in the genome.

img_3

Call SNVs in the nanopore aligment example as below:

longshot --bam $LRS_bam --ref $ref --out variant_calling/snvs/LRS_SNVs.vcf

When the SNV calling is done, compress and index the VCF file:

bgzip variant_calling/snvs/LRS_SNVs.vcf
tabix -p vcf variant_calling/snvs/LRS_SNVs.vcf.gz

Now you can compare the intersection between both LRS- and SRS-VCF files using bcftools:

bcftools isec -p variant_calling/snvs/isec variant_calling/snvs/LRS_SNVs.vcf.gz $SRS_snvs

This will create a folder named isec with the following files:

isec/0000.vcf   for records private to the long_reads_VCF
isec/0001.vcf   for records private to the short_reads_VCF
isec/0002.vcf   for records from long_reads_VCF shared by both
isec/0003.vcf   for records from short_reads_VCF shared by both

Hint: to count the number of variants (= number of rows in file excluding header) you can use the next command:

for f in variant_calling/snvs/isec/000*.vcf
do
    echo $f `bcftools view -H $f | wc -l`
done

Note that most of the variant that are private to short-reads are indels, since longshot does not call them.

To get the number of SNPs called by short-reads, and missed by long-reads, run:

bcftools view -HV indels variant_calling/snvs/isec/0001.vcf | wc -l

And to get how many of those are only PASS variants:

bcftools view -f PASS -HV indels variant_calling/snvs/isec/0001.vcf | wc -l

Let’s explore them in the IGV! Open IGV and load the Homo_sapiens.GRCh38.dna.fasta ($ref) reference genome by selecting Genomes>Load from File or Genomes>Load from URL. The new genome will be added to the drop-down menu, and also loaded and displayed in the IGV window.

Then, load your BAM files $LRS_bam and $SRS_bam.

Feel free to look at any of the variants. We also suggest:

Structural Variant Calling

Algorithms for calling SVs from long-read sequencing data include:

Here, we will use Sniffles for calling structural variants.

sniffles -m $LRS_bam -v variant_calling/svs/LRS_SVs.vcf

If you want to look at high quality SVs, you can change the -s parameter to 20, where s is the minimum number of reads that support a SV (by default is 10).

sniffles -m $LRS_bam -v variant_calling/svs/LRS_SVs_s20.vcf -s 20

The information that is provided in sniffles’s output can be found in: http://github.com/fritzsedlazeck/Sniffles/wiki/Output

Sort the VCF files (for that you’d need to compress them first) and index them:

bgzip variant_calling/svs/LRS_SVs.vcf
vcf-sort variant_calling/svs/LRS_SVs.vcf.gz | bgzip -c > variant_calling/svs/LRS_SVs.sort.vcf.gz
tabix -p vcf variant_calling/svs/LRS_SVs.sort.vcf.gz

Do the same for the s20 VCF

bgzip variant_calling/svs/LRS_SVs_s20.vcf
vcf-sort variant_calling/svs/LRS_SVs_s20.vcf.gz | bgzip -c > variant_calling/svs/LRS_SVs_s20.sort.vcf.gz
tabix -p vcf variant_calling/svs/LRS_SVs_s20.sort.vcf.gz

Now, we will compare how many SVs were called in each VCF:

bcftools view -H variant_calling/svs/LRS_SVs.sort.vcf.gz | wc -l
bcftools view -H variant_calling/svs/LRS_SVs_s20.sort.vcf.gz | wc -l

To investigate the differences, we are going to intersect both VCF files:

bcftools isec -p variant_calling/svs/isec variant_calling/svs/LRS_SVs.sort.vcf.gz variant_calling/svs/LRS_SVs_s20.sort.vcf.gz

Run the next command to know how many variants have been identified in each VCF file:

for f in variant_calling/svs/isec/000*.vcf
do
    echo $f `bcftools view -H $f | wc -l`
done

Make a Report

We will now make a report, like the one you made in the Quality Control section, but for the SVs you just called.

For that, we will use an already existing tutorial.

Open the config.yaml file:

cat ~/Course_Materials/data/variant_calling/ont_tutorial_sv/config.yaml

That should look like this:

---
# this config.yaml is passed to Snakefile in pipeline-structural-variation subfolder.
# Snakemake is run from this pipeline-structural-variation folder; it is necessary to
# pass an appropriate path to the input-files (the ../ prefix is sufficient for this demo)

# FASTA file containing the reference genome
reference_fasta: "~/Course_Materials/data/variant_calling/reference_genome/Homo_sapiens.GRCh38.dna.fasta"
# Sample name
sample_name: "SAMPLE"

##################################
## Tutorial specific parameters ##
##################################
tutorialText: FALSE
biocGenome: "hg38"
GeneIdMappings: "org.Hs.eg.db"
GenomeAnnotation: "TxDb.Hsapiens.UCSC.hg38.knownGene"
RepeatDB: "~/Course_Materials/data/variant_calling/repeatmasker/hg38.fa.out.gz"
bamFile: "~/Course_Materials/data/variant_calling/bams/LRS_alignment.bam"
vcfFile: "~/Course_Materials/wd/day2/variant_calling/svs/LRS_SVs.vcf.gz"

This is the config file that points to the path of the files that it requires.

To make the report, we are interested in the ont_tutorial_sv.Rmd file.

Open Rstudio and open it.

Press the “Knitr” button at the top of the page and wait for magic!

help

Again… if magic does not happen, R should have printed a .html file - open it by pasting the below in the terminal!

open ~/Course_Materials/data/variant_calling/ont_tutorial_sv/ont_tutorial_sv.html

Structural Variant Annotation

In order to perform annotation of the SVs from multiple sources, we will use AnnotSV.

AnnotSV annotates SVs with information about the genes (OMIM, ClinGen), regulatory elements (enhancers, promoters), pathogenicity (known from dbVar), frequencies (gnomAD, internal) and breakpoints (GC content, repeats) they overlap.

annotsv

To run a basic SV annotation, we will execute the following command:

AnnotSV -SVinputFile variant_calling/svs/LRS_SVs.sort.vcf.gz -SVinputInfo 1 \
     -genomeBuild GRCh38 \
     -outputDir annotation \
     -typeOfAnnotation full

Information about the output from AnnotSV can be found here.

Now, inspect the calls in PCDH15 gene:

grep PCDH15 annotation/LRS_SVs.sort.annotated.tsv

And visualise them in IGV.

Next, you can also look at the calls overlapping the ZNF804B gene.

grep ZNF804B annotation/LRS_SVs.sort.annotated.tsv

What do you think is happening there?

Yes! This is a complex SV!

Represent cxSVs using Circos

The previous cxSV involves multiple breakpoints across different chromosomes. To characterise it, you are going to represent it using Circos.

Circos needs the following input files:

To save you some time, we already prepared these files and put them under ~/Course_Materials/data/circos.

Open the files to see the information they contain.

Now, you can make the plot using circos software:

cd circos
circos -conf ~/Course_Materials/data/circos/circos.conf

Yes, it is true - making circos plots is not as difficult as it looks like!! Actually, it is quite cool :-)

You can now visualise the plot you just made:

xdg-open circos.png

circos