SlideShare a Scribd company logo
1 of 21
Assignment
GATK analysis pipeline
Department: 원예생명조경학과
GATK Best Practice?
As a standardization guideline for the entire process of detecting variations from read produced in
sample DNA, it is provided by GATK and is called GATK Best Practice.
The detailed process is provided by the GATK forum in step-by-step.
Sequence pre-processing
• Frist Check the quality of the raw data using fastQC tool
• FastQC is used to quality assessment on raw sequence data coming
from high throughput sequencing pipelines.
• Check the quality of raw data to see if there are any biased sequences
or systematic problems.
fastQC result of forward read
• Sequence length range 35-76 bp with GC 48%
• The sequence encoding is illumina 1.9
• The quality of sequence are ranged between Q30 to Q40
• The sequence reads have no adapter Content
fastQC result of forward read
• The sequence reads have no adapter Content
• The reads have no “N” base content
Sequence processing - Reference File Preprocessing
• Step 1: Download raw data & make BWA index file
<- Download raw data
<- Download Reference index
$ wget https://www.dropbox.com/sh/ql8d17tuk857269/AADj9NTXslE8Ke8He899vPU8a?dl=0 --content-
disposition --no-check-certificate
$ wget https://www.dropbox.com/sh/z6jqq4o29znv1xe/AABwmY0COapYuDZcMUJBn5ZKa?dl=0 --
content-disposition --no-check-certificate
Sequence processing - Reference File Preprocessing
• Step 2: Make FASTA Index file
Fasta: The most basic format for
expressing the obtained
sequence(reads)
$ samtools faidx ucsc.hg19.fasta
Sequence processing - Reference File Preprocessing
• Step 3: Make sequence dictionary
$ java -jar picard.jar CreateSequenceDictionary
REFERENCE=hg19_ucsc.hg19.fasta
OUTPUT=hg19_ucsc.hg19.dict
Dictionary: Data structure that stores data
in the form of key and value.
Sequence processing – Map to Reference
• Step 4. FASTQ to SAM
Fastq: fasta + Quality Value
SAM: Sequencing Alignment Map
Alignment: Link DNA sequence to
chromosome number and location
(=mapping)
$ bwa mem -R "@RGtID:testtSM:NA12878tPL:ILLUMINA" ucsc.hg19.fasta NA12878-12p-
11_S11_L001_R1_001.fastq.gz NA12878-12p-11_S11_L001_R2_001.fastq.gz > NA12878.mapped.sam
Sequence processing – Map to Reference
• Step 5. SAM to BAM
SAM files are converted into BAM files
in binary form because the capacity is
too large.
$ samtools view –Sb NA12878.mapped.sam > NA12878.mapped.bam
Sequence processing – Map to Reference
• Step 6. Make Sorted BAM
The BAM file should be sorted because the order of the reads is random. Make an
index file because of its large capacity.
$ samtools sort –o NA12878.mapped.sorted.bam NA12878.mapped.bam
$ samtools view NA12878.mapped.bam | head
$ samtools view NA12878.mapped.sorted.bam | head
Sequence processing – Mark Duplicate
• Step 7. Sorted BAM to Markdup BAM
$ java –jar picard.jar MarkDuplicates I=NA12878.mapped.sorted.bam O=NA12878.mapped.sorted.markdup.bam
M=NA12878.markdup.metrics.txt
The duplicate is derived from a single
read or fragment. There is a phenomenon
that a specific fragment is amplified in the
PCR process to generate a non-
information read. The technical bias
generated at this time is adjusted to Mark
Duplicates. This is done in one BAM or
SAM file with the alignment menu.
Sequence processing – Mark Duplicate
• Step 8. Make BAM index
$ samtools index NA12878.mapped.sorted.markdup.bam
Sequence processing – GATK
• Step 9. Download known SNP_db
$ wget https://www.dropbox.com/sh/byjfpgs9uh44vtr/AAAtF5HOvbTEiUCSRssCxAYRa?dl=0 --content-
disposition --no-check-certificate
SNP: SNP is the location where two or more allelic sequences that exist at a frequency of 1% or
more occur in a population group.
Sequence processing – GATK
• Step 10. GATK BaseRecalibrator
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar BaseRecalibrator -I
NA12878.mapped.sorted.bam -R ucsc.hg19.fasta --known-sites dbsnp_138.hg19.vcf -
-known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -O
NA12878.recal_data.table
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar ApplyBQSR -R
ucsc.hg19.fasta -I NA12878.mapped.sorted.bam --bqsr-recal-file
NA12878.recal_data.table -O NA12878.mapped.sorted.markdup.recal.bam
Since we often deal with big data genomes,
the number of errors can be hundreds of
millions, even if the fastqc result is Q20.
Therefore, we recalibrate each base score
once again to give a more accurate base
quality score.
Sequence processing – GATK
• Step 11. GATK HaplotypeCaller(variant calling; BAM to VCF)
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I
NA12878.mapped.sorted.markdup.recal.bam -O NA12878.g.vcg -ERC GVCF
$ wget https://www.dropbox.com/s/xc0coovcfftp3jc/NA12878.g.vcg?dl=0 --content-disposition --no-check-
certificate
This analysis tool uses the alignment
BAM file and is typically used during
variable calling. It also finds SNP and
InDel (Insertion/Deletion) for potential
variations, is widely used in de novo
assembly, and has the disadvantage of
good performance but slow speed.
Sequence processing – GATK
• Step 12. GATK Variant Filter
# Select SNP
# Filter SNP
# Select INDEL
# Filter INDEL
# Combine SNPs and INDELs
$ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar MergeVcfs -I NA12878.rawSNPs.Filtered.vcf -I
NA12878.rawINDELs.Filtered.vcf -O NA12878.Filtered.Variants.vcf
Sequence processing – Annotatation
• Step 13. Annotate Variants
$ java -Xmx4g -jar snpEff/snpEff.jar -v hg19 NA12878.Filtered.Variants.vcf >
NA12878.filtered.variants.annotated.vcf
java -jar -Xmx4g snpEff/snpSift annotate All_20150605.vcf.gz NA12878.Filtered.Variants.Annotated.vcf >
NA12878.Filtered.Variants.Annotated.dbsnp.vcf
java -jar -Xmx4g snpEff/snpSift annotate -name CLINVAR_clinvar_20190520.vcf.gz
NA12878..Filtered.Variants.Annotated.dbsnp.vcf > NA12878..Filtered.Variants.Annotated.dbsnp.CLINVAR.vcf
In order to obtain only the mutations of
interest among the filtered mutations, a kind
of annotation process is required to identify
each detection mutation based on the data
from the existing database and issue an ID.
Sequence processing – GATK
• Step 14. Take a look at the snpEff_summary.html
Sequence processing – GATK
• Step 15. Take a look at the snpEff_gene.txt with SNPedia
 Lupus
Jump to:navigation, search
Systemic lupus erythematosus (SLE) is a complex autoimmune
disease. Wikipedia The most studied genetic contributions to
SLE involve the major histocompatibility complex (MHC) region
on chromosome 6, which contains over 100 genes involved in
immune system function.
In the MHC, one allele in the class II region, and one SNP in the
class III region, have been associated with risk of
developing lupus. [PMID 17997607]
•The HLA-DRB1*0301 allele from the class II region
(see rs2187668)
•rs419788 in the intron of the class III SKIV2L gene
Thanks

More Related Content

Similar to Assignment-2 -upload.pptx

SVC / Storwize: cache partition analysis (BVQ howto)
SVC / Storwize: cache partition analysis  (BVQ howto)   SVC / Storwize: cache partition analysis  (BVQ howto)
SVC / Storwize: cache partition analysis (BVQ howto) Michael Pirker
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishingNikolay Vyahhi
 
Heterogeneous cloud framework for big data genome sequencing
Heterogeneous cloud framework for big data genome sequencingHeterogeneous cloud framework for big data genome sequencing
Heterogeneous cloud framework for big data genome sequencingieeepondy
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies
 
Tooling around in the jdk
Tooling around in the jdkTooling around in the jdk
Tooling around in the jdkBrant Boehmann
 
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...Kuldeep Jiwani
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
DistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsDistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsPaul Worrall
 
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016Esther Kundin
 
Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Shen Lu
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018javier ramirez
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Instrumentación de entrega continua con Gitlab
Instrumentación de entrega continua con GitlabInstrumentación de entrega continua con Gitlab
Instrumentación de entrega continua con GitlabSoftware Guru
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Maté Ongenaert
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1상욱 송
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 

Similar to Assignment-2 -upload.pptx (20)

SVC / Storwize: cache partition analysis (BVQ howto)
SVC / Storwize: cache partition analysis  (BVQ howto)   SVC / Storwize: cache partition analysis  (BVQ howto)
SVC / Storwize: cache partition analysis (BVQ howto)
 
Intro to sbt-web
Intro to sbt-webIntro to sbt-web
Intro to sbt-web
 
Assembly and finishing
Assembly and finishingAssembly and finishing
Assembly and finishing
 
Heterogeneous cloud framework for big data genome sequencing
Heterogeneous cloud framework for big data genome sequencingHeterogeneous cloud framework for big data genome sequencing
Heterogeneous cloud framework for big data genome sequencing
 
Hackingtomcat
HackingtomcatHackingtomcat
Hackingtomcat
 
Hacking Tomcat
Hacking TomcatHacking Tomcat
Hacking Tomcat
 
Spy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platformSpy hard, challenges of 100G deep packet inspection on x86 platform
Spy hard, challenges of 100G deep packet inspection on x86 platform
 
Tooling around in the jdk
Tooling around in the jdkTooling around in the jdk
Tooling around in the jdk
 
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...
OracleCode 2017: Performance Diagnostic Techniques for Big Data Solutions Usi...
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
DistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOpsDistributingSoftwareKnowledgeForDevOps
DistributingSoftwareKnowledgeForDevOps
 
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
Coprocessors - Uses, Abuses, Solutions - presented at HBaseCon East 2016
 
Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016Pasteur deep seq_analysis_theory_2016
Pasteur deep seq_analysis_theory_2016
 
Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013Summary of Journal_ShenLu_Summer2013
Summary of Journal_ShenLu_Summer2013
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Instrumentación de entrega continua con Gitlab
Instrumentación de entrega continua con GitlabInstrumentación de entrega continua con Gitlab
Instrumentación de entrega continua con Gitlab
 
Workshop NGS data analysis - 2
Workshop NGS data analysis - 2Workshop NGS data analysis - 2
Workshop NGS data analysis - 2
 
Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1Java 어플리케이션 성능튜닝 Part1
Java 어플리케이션 성능튜닝 Part1
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 

Recently uploaded

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 

Recently uploaded (20)

GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 

Assignment-2 -upload.pptx

  • 2. GATK Best Practice? As a standardization guideline for the entire process of detecting variations from read produced in sample DNA, it is provided by GATK and is called GATK Best Practice. The detailed process is provided by the GATK forum in step-by-step.
  • 3. Sequence pre-processing • Frist Check the quality of the raw data using fastQC tool • FastQC is used to quality assessment on raw sequence data coming from high throughput sequencing pipelines. • Check the quality of raw data to see if there are any biased sequences or systematic problems.
  • 4. fastQC result of forward read • Sequence length range 35-76 bp with GC 48% • The sequence encoding is illumina 1.9 • The quality of sequence are ranged between Q30 to Q40 • The sequence reads have no adapter Content
  • 5. fastQC result of forward read • The sequence reads have no adapter Content • The reads have no “N” base content
  • 6. Sequence processing - Reference File Preprocessing • Step 1: Download raw data & make BWA index file <- Download raw data <- Download Reference index $ wget https://www.dropbox.com/sh/ql8d17tuk857269/AADj9NTXslE8Ke8He899vPU8a?dl=0 --content- disposition --no-check-certificate $ wget https://www.dropbox.com/sh/z6jqq4o29znv1xe/AABwmY0COapYuDZcMUJBn5ZKa?dl=0 -- content-disposition --no-check-certificate
  • 7. Sequence processing - Reference File Preprocessing • Step 2: Make FASTA Index file Fasta: The most basic format for expressing the obtained sequence(reads) $ samtools faidx ucsc.hg19.fasta
  • 8. Sequence processing - Reference File Preprocessing • Step 3: Make sequence dictionary $ java -jar picard.jar CreateSequenceDictionary REFERENCE=hg19_ucsc.hg19.fasta OUTPUT=hg19_ucsc.hg19.dict Dictionary: Data structure that stores data in the form of key and value.
  • 9. Sequence processing – Map to Reference • Step 4. FASTQ to SAM Fastq: fasta + Quality Value SAM: Sequencing Alignment Map Alignment: Link DNA sequence to chromosome number and location (=mapping) $ bwa mem -R "@RGtID:testtSM:NA12878tPL:ILLUMINA" ucsc.hg19.fasta NA12878-12p- 11_S11_L001_R1_001.fastq.gz NA12878-12p-11_S11_L001_R2_001.fastq.gz > NA12878.mapped.sam
  • 10. Sequence processing – Map to Reference • Step 5. SAM to BAM SAM files are converted into BAM files in binary form because the capacity is too large. $ samtools view –Sb NA12878.mapped.sam > NA12878.mapped.bam
  • 11. Sequence processing – Map to Reference • Step 6. Make Sorted BAM The BAM file should be sorted because the order of the reads is random. Make an index file because of its large capacity. $ samtools sort –o NA12878.mapped.sorted.bam NA12878.mapped.bam $ samtools view NA12878.mapped.bam | head $ samtools view NA12878.mapped.sorted.bam | head
  • 12. Sequence processing – Mark Duplicate • Step 7. Sorted BAM to Markdup BAM $ java –jar picard.jar MarkDuplicates I=NA12878.mapped.sorted.bam O=NA12878.mapped.sorted.markdup.bam M=NA12878.markdup.metrics.txt The duplicate is derived from a single read or fragment. There is a phenomenon that a specific fragment is amplified in the PCR process to generate a non- information read. The technical bias generated at this time is adjusted to Mark Duplicates. This is done in one BAM or SAM file with the alignment menu.
  • 13. Sequence processing – Mark Duplicate • Step 8. Make BAM index $ samtools index NA12878.mapped.sorted.markdup.bam
  • 14. Sequence processing – GATK • Step 9. Download known SNP_db $ wget https://www.dropbox.com/sh/byjfpgs9uh44vtr/AAAtF5HOvbTEiUCSRssCxAYRa?dl=0 --content- disposition --no-check-certificate SNP: SNP is the location where two or more allelic sequences that exist at a frequency of 1% or more occur in a population group.
  • 15. Sequence processing – GATK • Step 10. GATK BaseRecalibrator $ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar BaseRecalibrator -I NA12878.mapped.sorted.bam -R ucsc.hg19.fasta --known-sites dbsnp_138.hg19.vcf - -known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf -O NA12878.recal_data.table $ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar ApplyBQSR -R ucsc.hg19.fasta -I NA12878.mapped.sorted.bam --bqsr-recal-file NA12878.recal_data.table -O NA12878.mapped.sorted.markdup.recal.bam Since we often deal with big data genomes, the number of errors can be hundreds of millions, even if the fastqc result is Q20. Therefore, we recalibrate each base score once again to give a more accurate base quality score.
  • 16. Sequence processing – GATK • Step 11. GATK HaplotypeCaller(variant calling; BAM to VCF) $ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar HaplotypeCaller -R ucsc.hg19.fasta -I NA12878.mapped.sorted.markdup.recal.bam -O NA12878.g.vcg -ERC GVCF $ wget https://www.dropbox.com/s/xc0coovcfftp3jc/NA12878.g.vcg?dl=0 --content-disposition --no-check- certificate This analysis tool uses the alignment BAM file and is typically used during variable calling. It also finds SNP and InDel (Insertion/Deletion) for potential variations, is widely used in de novo assembly, and has the disadvantage of good performance but slow speed.
  • 17. Sequence processing – GATK • Step 12. GATK Variant Filter # Select SNP # Filter SNP # Select INDEL # Filter INDEL # Combine SNPs and INDELs $ java -jar gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar MergeVcfs -I NA12878.rawSNPs.Filtered.vcf -I NA12878.rawINDELs.Filtered.vcf -O NA12878.Filtered.Variants.vcf
  • 18. Sequence processing – Annotatation • Step 13. Annotate Variants $ java -Xmx4g -jar snpEff/snpEff.jar -v hg19 NA12878.Filtered.Variants.vcf > NA12878.filtered.variants.annotated.vcf java -jar -Xmx4g snpEff/snpSift annotate All_20150605.vcf.gz NA12878.Filtered.Variants.Annotated.vcf > NA12878.Filtered.Variants.Annotated.dbsnp.vcf java -jar -Xmx4g snpEff/snpSift annotate -name CLINVAR_clinvar_20190520.vcf.gz NA12878..Filtered.Variants.Annotated.dbsnp.vcf > NA12878..Filtered.Variants.Annotated.dbsnp.CLINVAR.vcf In order to obtain only the mutations of interest among the filtered mutations, a kind of annotation process is required to identify each detection mutation based on the data from the existing database and issue an ID.
  • 19. Sequence processing – GATK • Step 14. Take a look at the snpEff_summary.html
  • 20. Sequence processing – GATK • Step 15. Take a look at the snpEff_gene.txt with SNPedia  Lupus Jump to:navigation, search Systemic lupus erythematosus (SLE) is a complex autoimmune disease. Wikipedia The most studied genetic contributions to SLE involve the major histocompatibility complex (MHC) region on chromosome 6, which contains over 100 genes involved in immune system function. In the MHC, one allele in the class II region, and one SNP in the class III region, have been associated with risk of developing lupus. [PMID 17997607] •The HLA-DRB1*0301 allele from the class II region (see rs2187668) •rs419788 in the intron of the class III SKIV2L gene