2. WHAT IS WES?
Sequencing of the whole exome (protein coding regions of the
genome)
Rabbani et al. reports that 85% of Mendelian disorders are linked to
mutations in exonic regions.
WES therefore can have great clinical utility.
Local connection: In 2010, Dr. Elizabeth Worthey of Medical College
of Wisconsin sequenced an exome of a child with severe ulcerative
colitis.
3. WES DATA
All NGS assays use the same data storage formats for output.
FASTQ, BAM
However, in RNA-Seq, we were interested in gene counts.
In WES data, we are interested in the differences between the human
reference sequence and the sample data.
We will annotate these differences to see if they are deleterious or
not.
4. WES DATA
• Genomes are getting cheaper
and cheaper.
• The SRA(NCBI Sequence Read
Archive) has trillions of base
pairs worth of data.
5. WHOLE EXOME PIPELINE
• We will be using a program called
SeqMule to automate the analysis of
our whole exome data.
6. PAIRED END SEQUENCING
• NGS data is almost always in a paired-end
format, which means that there are two files
associated with a particular run.
• For more information on the concept, I
refer you to http://goo.gl/7FKH6j.
7. STEP 1: DOWNLOAD DATA FROM
SRA
• The HapMap venture sequenced many
populations, including individuals of European
ancestry from Utah.
• One of these individuals, a child only known
by the sample accession number NA12878 is
probably the most sequenced individual on
Earth. You will download and analyze this
individual yourself.
• For demonstration, we will be downloading
another individual from the same cohort,
named NA07000.
8. STEP 1: DOWNLOAD DATA FROM
SRA
• Go the SRA-DNAnexus website and enter
SRR766039.
• Find the SRA file and download it.
9. STEP 1: DOWNLOAD DATA FROM
SRA
• Create a new folder in Linux and download the
SRA file into the folder.
• Commands:
mkdir NA07000; cd NA07000
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-
instant/reads/ByRun/sra/SRR/SRR766/SRR766039/SRR76
6039.sra
fastq-dump --split-3 SRR766039.sra
10. STEP 2: RUN FASTQC
• While still in the NA007000 folder, run FASTQC
to get quality metrics.
•
11. STEP 2: RUN FASTQC: FORWARD
READ
• 101 bp sequences, good quality
throughout.
12. STEP 2: RUN FASTQC: REVERSE
READ
• Illumina instruments always have
quality degradation at 3’ end of
reverse reads.
• This pair of FASTQs do not need
trimming.
13. STEP 3: UNDERSTAND SEQMULE
• Once we have the FASTQ files, we will then use
a program called SeqMule to:
1. Align the reads to the reference genome.
2. De-duplicate the alignment to remove PCR duplicate.
3. Re-align the reads around insertions and deletions.
4. Call variants, create VCF of consensus calls
5. Produce plots of coverage.
14. STEP 3: UNDERSTAND SEQMULE
• Type: less ~/NGSTools/SeqMule/advanced_config to
see the config file.
• In this file, these lines have 1 beside them for
(Run=True):
• 2P_bwamem=1 #BWA-MEM alignment
• 3p_samtools_rmdup=1 #use MarkDuplicates from Picard tools to
mark duplicates
• 4p_samtools_filter=1 #use 'samtools view' command to filter reads
under 30 MAPQ
• 6px_gatklite_realign=1 #use GenomeAnalysisTKLite from GATK to
generate GATK intervals and then do realignment
• 8p_gatk_HaplotypeCaller=1
• 8p_samtools_mpileup=1
• 8p_freebayes=1
15. STEP 4: RUN SEQMULE
• While in the NA07000 folder, run this command:
•seqmule pipeline -a SRR766039_1.fastq -b SRR766039_2.fastq
-e -prefix NA07000 -threads 7 -capture default
•-a: forward read
• -b: reverse read
• -e: exome data
• -prefix: what you want to name the sample
• -threads: how many cores you want for alignment. 7 is good enough.
• -capture: default exome
•Seqmule should begin to run without stopping
immediately.
• Wait 4 hours.
16. STEP 5: EXAMINE OUTPUT
• Open the NA00070_report folder after completion.
• Open the summary.html file to observe the results of the
SeqMule run.
17. STEP 6: ANNOTATE CONSENSUS
VCF
• Go to http://wannovar.usc.edu/ to use
wANNOVAR, a web tool to annotate genomic
variants. I have used custom filtering to filter out
variants which are found in less than 5% of the
population. Press Submit when ready.
18. STEP 7: DOWNLOAD CSV FILE AND
FILTER
• When wANNOVAR is complete, you have two
choices.
1. Download the full annotation in CSV or TXT format to
upload into Excel for manipulation.
2. Download the Step 3 VCF (if you used Custom Filtering)
and re-annotate the VCF a second time to only annotate
your filtered variants.
3. Use IGV to confirm variant depth by opening the realigned
BAM file.
19. NOW IT’S YOUR TURN!
• You will run sample NA12878 through our whole-exome pipeline.
1. Create the sample folder.
2. Download a high-quality exome run for NA12878 using these
commands:
1. wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-NGv3-LAB1360-
A_1.fastq.gz
2. wget https://s3.amazonaws.com/bcbio_nextgen/NA12878-NGv3-LAB1360-
A_2.fastq.gz
3. Run FastQC on the reads.
4. Run Seqmule: seqmule pipeline -a NA12878-NGv3-LAB1360-
A_1.fastq.gz -b NA12878-NGv3-LAB1360-A_1.fastq.gz -e -prefix
NA12878 -capture default
5. Upload consensus VCF to wANNOVAR, open realigned BAM in IGV,
and explore the most sequenced genome in the world!