6. Phasing
In this section we will cover:
You will learn to:
- Phase a vcf file
- Phase a pedigree
- Visualise pedigree information on IGV
- Annotate bam file with phase and haplotype information
N.B.: Note these scripts use path/to/Course_Materials/
as working directory (wd). To check what directory you are using as wd
do pwd
. If the output is different from path/to/Course_Materials
please go to Course_Materials/
with the cd
command.
We create the output directory and then we set two variables: IN
and OUT.
They store the path for the reference/input files and output files respectively.
mkdir -p wd/day2/phasing/
IN="data/phasing"
OUT="wd/day2/phasing"
VCF files
Let’s have another quick look at the VCF file structure:
bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | head | less -S
For info on the VCF structure go here
VCF stats
As we always do let’s have a look at the VCF stats report, as a quick quality check.
# Print some stats about the VCF file
bcftools stats ${IN}/res/sample1_short_reads.vcf.gz
# Select the stats information you want to visualise
bcftools stats ${IN}/res/sample1_short_reads.vcf.gz | grep -w "ST"
Annotate VCF files VEP
It is extrimely helpful to know where and what are the variants in my sample. However, as important as the variants is their allele frequency (AF). There is a widely used tool that can annotate VCF files with AF: VEP. VEP has a command line tool but also a web interface. If you go (here)[https://www.ensembl.org/Tools/VEP] you can annotate your VCF via a browser submission.
(bcftools view -h ${IN}/res/sample1_short_reads.vcf.gz | tail -n 1 | cut -f 1-7 ; bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | head | cut -f 1-7) > ${OUT}/dummy_file.vcf
#view the dummy vcf we made
cat ${OUT}/dummy_file.vcf | less -S
There is a simple VEP
command line to use. Which can be something like vep -i file.vcf --cache
. But you could also use the web interface, specially if you need to annotate a few files.
Phasing - WhatsHap
For this practical we are going to use the WhatsHap tool, which uses python and is based on the homonym algorithm. To download WhatsHap.
WhatsHap can phase SNVs and indels, but it does not work on structural variants. It is very easy to use: it takes as input a BAM file and a VCF file and returns a second VCF file improved with the phasing information. BAM and VCF files don’t have to be derived from the same set of reads. This is convenient bacause one can use high-quality short reads (such us Illumina reads) to call the variants and then use long reads (such us Oxford Nanopore reads) to phase them. Indeed, this workflow that has just been described is the raccomended one. Please note that using 2 different sets of reads may create some problems to the WhatsHap algortihm, which is trying to match the sample name (@RG
header line) of the BAM file to the VCF file. It tries to match them because this tool can process VCF and BAM containing different samples in the same file. Nonetheless, there are a couple of ways to work around this problem.
The main WhatsHap subcommand is phase
, which is the backbone of the entire tool, it uses reads containing 2 or more variants to assign them to one allele or the other. In this example we are going to phase a test VCF.
# The bamtools needs to be indexed bedore, so we do:
samtools index ${IN}/res/sample1_long_reads.bam
whatshap phase --indels --ignore-read-groups --reference ${IN}/res/reference_chr20.fa -o ${OUT}/sample1_phased.vcf ${IN}/res/sample1_short_reads.vcf.gz ${IN}/res/sample1_long_reads.bam
--ignore-read-groups
asks whatshap
to overlook the sample name information. But, because of this we have to be sure that the input files refer to the same sample.
You can use the following commands to check the sample names in the 2 files:
samtools view -H ${IN}/res/sample1_long_reads.bam | grep "@RG"
# sample name isthe one starting with "SM:"
bcftools query -l ${IN}/res/sample1_short_reads.vcf.gz
WhatsHap was able to discriminate the allele for 4601 variants as it printed out in the report (see Found 19244 reads covering 4601 variants
). If we now look at SNVs and how they are distributed on the alleles, we should see that the command has divided them into allele 1 and allele 2 (when this was possible):
# Before the phasing
bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | grep -c "0/1"
bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | grep -c "0|1"
bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | grep -c "1/0"
bcftools view -H ${IN}/res/sample1_short_reads.vcf.gz | grep -c "1|0"
# After the phasing
bcftools view -H ${OUT}/sample1_phased.vcf | grep -c "0/1"
bcftools view -H ${OUT}/sample1_phased.vcf | grep -c "0|1"
bcftools view -H ${OUT}/sample1_phased.vcf | grep -c "1/0"
bcftools view -H ${OUT}/sample1_phased.vcf | grep -c "1|0"
There are 2 different genotype separators (i.e. | or / ). They refer to phased (‘ |
’) and unphased (‘/’) reads. After the whatshap phasing you’ll see that the majority of variants have been assigned to one allele or the other. |
Phasing - Visualization
Visualise Haplotypes
It is always good practice to visually inspect our data, for this purpose we could use IGV. WhatsHap has a script to convert the VCF annotation to a GTF format that can be loaded on IGV.
whatshap stats --gtf=${OUT}/sample1_phased.gtf ${OUT}/sample1_phased.vcf
This command will print out some stats about the haplotype blocks called in the previous command and it will also create a GTF file that we can load on IGV. Open the IGV browser and load the ${OUT}/sample1_phased.gtf
file you just created.
Go on chr20:40000000-45000000 to look at the subset we are using.
Visualise Grouped Reads
We could also visualise haplotype blocks and alleles directly on the reads. For this, we will need the VCF file we created and the BAM file used to create it. The command adds 2 tags to the bam file PS
and HP
which stand for phase
and haplotype
respectively. On IGV we can use this tag to colour and order our reads.
bcftools view -O z -o ${OUT}/sample1_phased.vcf.gz ${OUT}/sample1_phased.vcf
bcftools index ${OUT}/sample1_phased.vcf.gz
mv ${OUT}/sample1_phased.vcf.gz.csi ${OUT}/sample1_phased.vcf.gz.tbi
whatshap haplotag --ignore-read-groups -o ${OUT}/sample1_haplotagged.bam --reference ${IN}/res/reference_chr20.fa ${OUT}/sample1_phased.vcf.gz ${IN}/res/sample1_long_reads.bam
samtools index ${OUT}/sample1_haplotagged.bam
Now, open the ${OUT}/sample1_haplotagged.bam
file on IGV.
Color alignment by
> tag
and insert PS
and Group alignment by
> tag
and insert PS
Phasing - Pedigrees
If we have sequencing data coming from our sample’s parents, we could use them to phase the VCF file according to the paternal and maternal alleles. To do this, we need BAM and VCF files of the trio and an additional file called PED. PED files have this structure:
family_ID, proband_id, paternal_id, maternal_id, sex, phenotype
To use this WhatsHap
function we need to merge the VCF files into one as well as the bam files.
# Merge the VCF files
bcftools index ${IN}/res/sample1_short_reads.vcf.gz
bcftools index ${IN}/res/sample2_short_reads.vcf.gz
bcftools index ${IN}/res/sample3_short_reads.vcf.gz
bcftools merge -O z -o ${OUT}/merged_pedigree_samples.vcf.gz --merge all ${IN}/res/sample1_short_reads.vcf.gz ${IN}/res/sample2_short_reads.vcf.gz ${IN}/res/sample3_short_reads.vcf.gz
#Merge the BAM files
samtools index ${IN}/res/sample2_short_reads.bam
samtools index ${IN}/res/sample3_short_reads.bam
samtools merge -r -O BAM ${OUT}/merged_pedigree_samples.bam ${IN}/res/sample1_long_reads.bam ${IN}/res/sample2_short_reads.bam ${IN}/res/sample3_short_reads.bam
samtools index ${OUT}/merged_pedigree_samples.bam
After we created the new input files, we can run the whatshap phase
command:
whatshap phase --ped ${IN}/res/PED_ONT.txt -o ${OUT}/pedigree_phased.vcf --reference ${IN}/res/reference_chr20.fa ${OUT}/merged_pedigree_samples.vcf.gz ${OUT}/merged_pedigree_samples.bam
Once we have phased the pedigree, the output will look something like this:
# NB this VCF file has been simplified
CHROM POS ID REF ALT FORMAT SAMPLE
chr20 40003760 rs6071894 G T GT 1|0 # Alternative allele comes from the father
chr20 44705582 rs373596784 AAT A GT 0|1 # Alternative allele comes from the mother
chr20 44928864 rs57778806;rs796352932 CAA CA,C GT 1|2 # Both alleles are different from the reference genome, where 1 comes from paternal and 2 from maternal allele
The paternal_allele | maternal_allele
annotation used in WhatsHap is a convention started with the 1000 genomes project and kept afterwards.
Open the phasing/pedigree_phased.vcf
file on IGV and have a look at the VCF file. It reports the variants of the trio.
Phasing - Excercise
If you made it this far and you still have time and stamina, there is a quick exercise you may want to do. For sample2 and sample3 we have VCF and BAM files. However, the BAM files are derived from Illumina sequencing. You can try to phase this samples and compare the haplotype you get to the one we got for sample1.
References
Marcel Martin, Murray Patterson, Shilpa Garg, Sarah O. Fischer, Nadia Pisanti, Gunnar W. Klau, Alexander Schoenhuth, Tobias Marschall. WhatsHap: fast and accurate read-based phasing, bioRxiv 085050
Hahne F, Ivanek R (2016). “Statistical Genomics: Methods and Protocols.” In Mathé E, Davis S (eds.), chapter Visualizing Genomic Data Using Gviz and Bioconductor, 335–351. Springer New York, New York, NY. ISBN 978-1-4939-3578-9