2. GATK Best Practice?
As a standardization guideline for the entire process of detecting variations from read produced in
sample DNA, it is provided by GATK and is called GATK Best Practice.
The detailed process is provided by the GATK forum in step-by-step.
3. Sequence pre-processing
• Frist Check the quality of the raw data using fastQC tool
• FastQC is used to quality assessment on raw sequence data coming
from high throughput sequencing pipelines.
• Check the quality of raw data to see if there are any biased sequences
or systematic problems.
4. fastQC result of forward read
• Sequence length range 35-76 bp with GC 48%
• The sequence encoding is illumina 1.9
• The quality of sequence are ranged between Q30 to Q40
• The sequence reads have no adapter Content
5. fastQC result of forward read
• The sequence reads have no adapter Content
• The reads have no “N” base content
6. Sequence processing - Reference File Preprocessing
• Step 1: Download raw data & make BWA index file
<- Download raw data
<- Download Reference index
$ wget https://www.dropbox.com/sh/ql8d17tuk857269/AADj9NTXslE8Ke8He899vPU8a?dl=0 --content-
disposition --no-check-certificate
$ wget https://www.dropbox.com/sh/z6jqq4o29znv1xe/AABwmY0COapYuDZcMUJBn5ZKa?dl=0 --
content-disposition --no-check-certificate
7. Sequence processing - Reference File Preprocessing
• Step 2: Make FASTA Index file
Fasta: The most basic format for
expressing the obtained
sequence(reads)
$ samtools faidx ucsc.hg19.fasta
8. Sequence processing - Reference File Preprocessing
• Step 3: Make sequence dictionary
$ java -jar picard.jar CreateSequenceDictionary
REFERENCE=hg19_ucsc.hg19.fasta
OUTPUT=hg19_ucsc.hg19.dict
Dictionary: Data structure that stores data
in the form of key and value.
9. Sequence processing – Map to Reference
• Step 4. FASTQ to SAM
Fastq: fasta + Quality Value
SAM: Sequencing Alignment Map
Alignment: Link DNA sequence to
chromosome number and location
(=mapping)
$ bwa mem -R "@RGtID:testtSM:NA12878tPL:ILLUMINA" ucsc.hg19.fasta NA12878-12p-
11_S11_L001_R1_001.fastq.gz NA12878-12p-11_S11_L001_R2_001.fastq.gz > NA12878.mapped.sam
10. Sequence processing – Map to Reference
• Step 5. SAM to BAM
SAM files are converted into BAM files
in binary form because the capacity is
too large.
$ samtools view –Sb NA12878.mapped.sam > NA12878.mapped.bam
11. Sequence processing – Map to Reference
• Step 6. Make Sorted BAM
The BAM file should be sorted because the order of the reads is random. Make an
index file because of its large capacity.
$ samtools sort –o NA12878.mapped.sorted.bam NA12878.mapped.bam
$ samtools view NA12878.mapped.bam | head
$ samtools view NA12878.mapped.sorted.bam | head
12. Sequence processing – Mark Duplicate
• Step 7. Sorted BAM to Markdup BAM
$ java –jar picard.jar MarkDuplicates I=NA12878.mapped.sorted.bam O=NA12878.mapped.sorted.markdup.bam
M=NA12878.markdup.metrics.txt
The duplicate is derived from a single
read or fragment. There is a phenomenon
that a specific fragment is amplified in the
PCR process to generate a non-
information read. The technical bias
generated at this time is adjusted to Mark
Duplicates. This is done in one BAM or
SAM file with the alignment menu.
13. Sequence processing – Mark Duplicate
• Step 8. Make BAM index
$ samtools index NA12878.mapped.sorted.markdup.bam
14. Sequence processing – GATK
• Step 9. Download known SNP_db
$ wget https://www.dropbox.com/sh/byjfpgs9uh44vtr/AAAtF5HOvbTEiUCSRssCxAYRa?dl=0 --content-
disposition --no-check-certificate
SNP: SNP is the location where two or more allelic sequences that exist at a frequency of 1% or
more occur in a population group.
15. Sequence processing – GATK
• Step 10. GATK BaseRecalibrator
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar BaseRecalibrator -I
NA12878.mapped.sorted.bam -R ucsc.hg19.fasta --known-sites dbsnp_138.hg19.vcf -
-known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -O
NA12878.recal_data.table
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar ApplyBQSR -R
ucsc.hg19.fasta -I NA12878.mapped.sorted.bam --bqsr-recal-file
NA12878.recal_data.table -O NA12878.mapped.sorted.markdup.recal.bam
Since we often deal with big data genomes,
the number of errors can be hundreds of
millions, even if the fastqc result is Q20.
Therefore, we recalibrate each base score
once again to give a more accurate base
quality score.
16. Sequence processing – GATK
• Step 11. GATK HaplotypeCaller(variant calling; BAM to VCF)
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I
NA12878.mapped.sorted.markdup.recal.bam -O NA12878.g.vcg -ERC GVCF
$ wget https://www.dropbox.com/s/xc0coovcfftp3jc/NA12878.g.vcg?dl=0 --content-disposition --no-check-
certificate
This analysis tool uses the alignment
BAM file and is typically used during
variable calling. It also finds SNP and
InDel (Insertion/Deletion) for potential
variations, is widely used in de novo
assembly, and has the disadvantage of
good performance but slow speed.
18. Sequence processing – Annotatation
• Step 13. Annotate Variants
$ java -Xmx4g -jar snpEff/snpEff.jar -v hg19 NA12878.Filtered.Variants.vcf >
NA12878.filtered.variants.annotated.vcf
java -jar -Xmx4g snpEff/snpSift annotate All_20150605.vcf.gz NA12878.Filtered.Variants.Annotated.vcf >
NA12878.Filtered.Variants.Annotated.dbsnp.vcf
java -jar -Xmx4g snpEff/snpSift annotate -name CLINVAR_clinvar_20190520.vcf.gz
NA12878..Filtered.Variants.Annotated.dbsnp.vcf > NA12878..Filtered.Variants.Annotated.dbsnp.CLINVAR.vcf
In order to obtain only the mutations of
interest among the filtered mutations, a kind
of annotation process is required to identify
each detection mutation based on the data
from the existing database and issue an ID.
20. Sequence processing – GATK
• Step 15. Take a look at the snpEff_gene.txt with SNPedia
Lupus
Jump to:navigation, search
Systemic lupus erythematosus (SLE) is a complex autoimmune
disease. Wikipedia The most studied genetic contributions to
SLE involve the major histocompatibility complex (MHC) region
on chromosome 6, which contains over 100 genes involved in
immune system function.
In the MHC, one allele in the class II region, and one SNP in the
class III region, have been associated with risk of
developing lupus. [PMID 17997607]
•The HLA-DRB1*0301 allele from the class II region
(see rs2187668)
•rs419788 in the intron of the class III SKIV2L gene