SlideShare a Scribd company logo
The ENCODE DCC 
https://www.encodeproject.org/ 
Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCC 
PI: J. Michael Cherry, Ph.D. 
Department of Genetics • Stanford University School of Medicine
What is the ENCODE Consortium? 
Image credit: NHGRI
Role of the Data Coordination Center 
Production labs 
Analysis groups 
ENCODE portal 
(DCC) 
Data files 
Metadata DDCCCC Integrative 
Role: Data generation Data organization Data access 
Genome Browser 
Tasks: Perform assays Define submission process Web-based searches 
Perform analyses Data processing & validation Data downloads 
Validate data Data file storage 
Submit data files Metadata curation 
Submit metadata 
websites 
Scientific 
community
DCC goals for implementation 
Transparency of methods 
• How was the experiment performed? 
• What software was used to analyze the data? 
Reproducibility of results 
• What files were used? 
• What software and parameters were used for the pipelines? 
Interoperability with other genomic projects 
• Can the pipeline software we use be used by other projects? 
• Can the metadata allow easy integration with other data?
Data volume: diversity of assays 
Modified from PLoS Biol 9-e1001046,2011 
(M. Pazin) 
Approximately ~30 different assays
Data volume: number of assays 
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
Transparency & reproducibility: 
Capture the experimental design 
Biological 
replicate 1 
Technical 
replicate 1 
Biological 
replicate 2 
Raw data 
file (fastq) 
Processed 
file (bam) 
Experiment 
Software & 
pipelines 
Technical 
replicate 1 
Raw data 
file (fastq) 
Processed 
file (bam) 
Software & 
pipelines 
Biological 
replicate 1 
Technical 
replicate 1 
Raw data 
file (fastq) 
Processed 
file (bam) 
Control 
experiment 
Software & 
pipelines 
Processed file 
(peak calls) 
Software & 
pipelines
Data interoperability: 
uniform processing pipelines 
(includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
Processing of TF ChIP-seq assays 
FASTQ (SE/PE) 
Replicates 
Controls 
Map Reads 
Filter 
Pool 
Subsample 
Pseudoreplicates Call Peaks 
IDR 
BAM 
Replicates 
Pooled Reps 
Controls 
Signal Tracks 
BAM 
2 Pseudoreplicates 
per replicate 
2 Pseudoreplicates 
per pool 
peak 
Replicates 
Pseudoreplicates 
Pools 
peak 
IDR-thresholded 
Peak Calls 
bigWig 
Replicates 
Pooled Replicates 
Specification document (Anshul Kundaje): 
https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing
Relative CPU time for ChIP-seq (original) 
Map 
Signal Tracks 
Subsample 
Call Peaks 
IDR 
IDR 
Peak 
Calling 
Relative CPU time per step for a typical transcription factor ChIP-seq experiment 
IDR can take much longer if there are many regions, as in a typical histone ChIP 
Nikhil Podduturi
Data volume: TF ChIp-seq 
(includes mouse & human)
Performance Comparison: 
IDR analysis CPU (re-engineered) vs GPU 
1	 10	 100	 1000	 10000	 
NVIDIA	GPU	 
CPU	 
Clock Time (Seconds) Log10 scale 
~120x Speed Increase 
60 min 
30 sec 
Nikhil Podduturi
Impact on use for data processing 
Re-engineered 
• improved stability 
• tests! 
• ability to run on CPU or GPU 
Faster processing 
• recalculation of entire data corpus against new genome build 
• allow determination of data-based thresholds and cut-offs 
Public availability 
• Can be run on GPU instances available at AWS 
• GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU 
• TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq 
• Others available: https://github.com/ENCODE-DCC
Next Steps 
Data validation 
• GPU vs CPU results 
Pipeline release 
• Integration into ChIP-seq pipeline 
• Deployment via AWS instances and at DNAnexus 
Adapt additional software components 
• SPP: https://github.com/nikhilRP/spp-GPU 
• Hotspots: https://github.com/nikhilRP/hotspot-GPU
15 
ENCODE DCC 
Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz 
Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan 
Nikhil Podduturi, Laurence Rowe, Forrest Tanaka 
Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho 
Data Wranglers 
Software 
engineers 
QA, sysadmins, 
admin, biocurator 
assistant 
@encodedcc encode-help@lists.stanford.edu 
https://github.com/ENCODE-DCC/ 
The ENCODE DCC is funded by NHGRI Grant U41HG006992
ENCODE Uniform Processing Pipeline Work 
Ben Hitz, Seth Strattan, Nikhil Podduturi 
ChIP-seq against transcription factors: Anshul Kundaje 
ChIP-seq against histone marks: Anshul Kundaje 
RNA-seq: ENCODE RNA working group 
Whole genome bisulfite sequencing: Junko Tsuji, ZhipingWeng 
DNAse-seq: Alvin Qin, Shirley Liu 
DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad Rodeh 
NVIDIA Corporation: NVIDIA Academic Hardware donation program 
donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework

More Related Content

Viewers also liked

ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC
 
Ontology application and use at the encode dcc
Ontology application and use at the encode dccOntology application and use at the encode dcc
Ontology application and use at the encode dcc
ENCODE-DCC
 
Cross-linked metadata standards, repositories and the data policies - The Bio...
Cross-linked metadata standards, repositories and the data policies - The Bio...Cross-linked metadata standards, repositories and the data policies - The Bio...
Cross-linked metadata standards, repositories and the data policies - The Bio...
Peter McQuilton
 
Metadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE PortalMetadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE Portal
ENCODE-DCC
 
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSONGI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
ENCODE-DCC
 
Introduction to Git and GitHub
Introduction to Git and GitHubIntroduction to Git and GitHub
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshare
Rakesh Sukumar
 
Git and GitHub for Documentation
Git and GitHub for DocumentationGit and GitHub for Documentation
Git and GitHub for Documentation
Anne Gentle
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
HubSpot
 

Viewers also liked (9)

ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014ENCODE-DCC-metadata-standard-Biocurator 2014
ENCODE-DCC-metadata-standard-Biocurator 2014
 
Ontology application and use at the encode dcc
Ontology application and use at the encode dccOntology application and use at the encode dcc
Ontology application and use at the encode dcc
 
Cross-linked metadata standards, repositories and the data policies - The Bio...
Cross-linked metadata standards, repositories and the data policies - The Bio...Cross-linked metadata standards, repositories and the data policies - The Bio...
Cross-linked metadata standards, repositories and the data policies - The Bio...
 
Metadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE PortalMetadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE Portal
 
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSONGI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
 
Introduction to Git and GitHub
Introduction to Git and GitHubIntroduction to Git and GitHub
Introduction to Git and GitHub
 
Introduction to github slideshare
Introduction to github slideshareIntroduction to github slideshare
Introduction to github slideshare
 
Git and GitHub for Documentation
Git and GitHub for DocumentationGit and GitHub for Documentation
Git and GitHub for Documentation
 
Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners Git 101: Git and GitHub for Beginners
Git 101: Git and GitHub for Beginners
 

Similar to Implementation of GPU-based bioinformatic tools at the ENCODE DCC

2014 genome informatics Linked Data
2014 genome informatics Linked Data2014 genome informatics Linked Data
2014 genome informatics Linked Data
ENCODE-DCC
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
Barbera van Schaik
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
t_ivanov
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
inovex GmbH
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
Lior Sidi
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
Juan Antonio Vizcaino
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
Kento Aoyama
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
Ian Foster
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
Object Automation
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Xing Xu
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Object Automation
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
bosc_2008
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
marpierc
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Ian Foster
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 
Grid computing
Grid computingGrid computing
Grid computing
Ramraj Choudhary
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
João André Carriço
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
ZainAsgar1
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
Ian Foster
 

Similar to Implementation of GPU-based bioinformatic tools at the ENCODE DCC (20)

2014 genome informatics Linked Data
2014 genome informatics Linked Data2014 genome informatics Linked Data
2014 genome informatics Linked Data
 
Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...Initial steps towards a production platform for DNA sequence analysis on the ...
Initial steps towards a production platform for DNA sequence analysis on the ...
 
BDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBenchBDSE 2015 Evaluation of Big Data Platforms with HiBench
BDSE 2015 Evaluation of Big Data Platforms with HiBench
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 
GPU and Deep learning best practices
GPU and Deep learning best practicesGPU and Deep learning best practices
GPU and Deep learning best practices
 
ProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easyProteomeXchange: data deposition and data retrieval made easy
ProteomeXchange: data deposition and data retrieval made easy
 
Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...Reproducibility of computational workflows is automated using continuous anal...
Reproducibility of computational workflows is automated using continuous anal...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Designing High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPCDesigning High performance & Scalable Middleware for HPC
Designing High performance & Scalable Middleware for HPC
 
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics DataBest pratices at BGI for the Challenges in the Era of Big Genomics Data
Best pratices at BGI for the Challenges in the Era of Big Genomics Data
 
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data ScienceDesigning High-Performance and Scalable Middleware for HPC, AI and Data Science
Designing High-Performance and Scalable Middleware for HPC, AI and Data Science
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial IntroOGCE TeraGrid 2010 Science Gateway Tutorial Intro
OGCE TeraGrid 2010 Science Gateway Tutorial Intro
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Grid computing
Grid computingGrid computing
Grid computing
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 

Recently uploaded

Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
vimalveerammal
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
PirithiRaju
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Sérgio Sacani
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
fatima132662
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
RAYMUNDONAVARROCORON
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
lucianamillenium
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
Sérgio Sacani
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
Sérgio Sacani
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
shubhijain836
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Creative-Biolabs
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Sérgio Sacani
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
sammy700571
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 

Recently uploaded (20)

Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5Quality assurance B.pharm 6th semester BP606T UNIT 5
Quality assurance B.pharm 6th semester BP606T UNIT 5
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf11.1 Role of physical biological in deterioration of grains.pdf
11.1 Role of physical biological in deterioration of grains.pdf
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
 
Physiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptxPhysiology of Nervous System presentation.pptx
Physiology of Nervous System presentation.pptx
 
Clinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdfClinical periodontology and implant dentistry 2003.pdf
Clinical periodontology and implant dentistry 2003.pdf
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf2001_Book_HumanChromosomes - Genéticapdf
2001_Book_HumanChromosomes - Genéticapdf
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
SDSS1335+0728: The awakening of a ∼ 106M⊙ black hole⋆
 
Signatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coastsSignatures of wave erosion in Titan’s coasts
Signatures of wave erosion in Titan’s coasts
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptxTOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
TOPIC OF DISCUSSION: CENTRIFUGATION SLIDESHARE.pptx
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
Mechanisms and Applications of Antiviral Neutralizing Antibodies - Creative B...
 
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
Compositions of iron-meteorite parent bodies constrainthe structure of the pr...
 
Microbiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdfMicrobiology of Central Nervous System INFECTIONS.pdf
Microbiology of Central Nervous System INFECTIONS.pdf
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 

Implementation of GPU-based bioinformatic tools at the ENCODE DCC

  • 1. The ENCODE DCC https://www.encodeproject.org/ Eurie L. Hong, Ph.D. • Project Manager, ENCODE DCC PI: J. Michael Cherry, Ph.D. Department of Genetics • Stanford University School of Medicine
  • 2. What is the ENCODE Consortium? Image credit: NHGRI
  • 3. Role of the Data Coordination Center Production labs Analysis groups ENCODE portal (DCC) Data files Metadata DDCCCC Integrative Role: Data generation Data organization Data access Genome Browser Tasks: Perform assays Define submission process Web-based searches Perform analyses Data processing & validation Data downloads Validate data Data file storage Submit data files Metadata curation Submit metadata websites Scientific community
  • 4. DCC goals for implementation Transparency of methods • How was the experiment performed? • What software was used to analyze the data? Reproducibility of results • What files were used? • What software and parameters were used for the pipelines? Interoperability with other genomic projects • Can the pipeline software we use be used by other projects? • Can the metadata allow easy integration with other data?
  • 5. Data volume: diversity of assays Modified from PLoS Biol 9-e1001046,2011 (M. Pazin) Approximately ~30 different assays
  • 6. Data volume: number of assays (includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
  • 7. Transparency & reproducibility: Capture the experimental design Biological replicate 1 Technical replicate 1 Biological replicate 2 Raw data file (fastq) Processed file (bam) Experiment Software & pipelines Technical replicate 1 Raw data file (fastq) Processed file (bam) Software & pipelines Biological replicate 1 Technical replicate 1 Raw data file (fastq) Processed file (bam) Control experiment Software & pipelines Processed file (peak calls) Software & pipelines
  • 8. Data interoperability: uniform processing pipelines (includes mouse & human, from https://www.encodeproject.org/, 10/12/2014)
  • 9. Processing of TF ChIP-seq assays FASTQ (SE/PE) Replicates Controls Map Reads Filter Pool Subsample Pseudoreplicates Call Peaks IDR BAM Replicates Pooled Reps Controls Signal Tracks BAM 2 Pseudoreplicates per replicate 2 Pseudoreplicates per pool peak Replicates Pseudoreplicates Pools peak IDR-thresholded Peak Calls bigWig Replicates Pooled Replicates Specification document (Anshul Kundaje): https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit?usp=sharing
  • 10. Relative CPU time for ChIP-seq (original) Map Signal Tracks Subsample Call Peaks IDR IDR Peak Calling Relative CPU time per step for a typical transcription factor ChIP-seq experiment IDR can take much longer if there are many regions, as in a typical histone ChIP Nikhil Podduturi
  • 11. Data volume: TF ChIp-seq (includes mouse & human)
  • 12. Performance Comparison: IDR analysis CPU (re-engineered) vs GPU 1 10 100 1000 10000 NVIDIA GPU CPU Clock Time (Seconds) Log10 scale ~120x Speed Increase 60 min 30 sec Nikhil Podduturi
  • 13. Impact on use for data processing Re-engineered • improved stability • tests! • ability to run on CPU or GPU Faster processing • recalculation of entire data corpus against new genome build • allow determination of data-based thresholds and cut-offs Public availability • Can be run on GPU instances available at AWS • GPU implementation of IDR: https://github.com/ENCODE-DCC/idr-GPU • TF ChIP-seq: https://github.com/ENCODE-DCC/tf_chipseq • Others available: https://github.com/ENCODE-DCC
  • 14. Next Steps Data validation • GPU vs CPU results Pipeline release • Integration into ChIP-seq pipeline • Deployment via AWS instances and at DNAnexus Adapt additional software components • SPP: https://github.com/nikhilRP/spp-GPU • Hotspots: https://github.com/nikhilRP/hotspot-GPU
  • 15. 15 ENCODE DCC Eurie Hong, Mike Cherry (PI), Jim Kent (co-PI), Ben Hitz Esther Chan, Jean Davidson, Venkat Malladi, Cricket Sloan, Seth Strattan Nikhil Podduturi, Laurence Rowe, Forrest Tanaka Brian Lee, Stuart Miyasato, Matt Simison, Zhenhua Wang, Marcus Ho Data Wranglers Software engineers QA, sysadmins, admin, biocurator assistant @encodedcc encode-help@lists.stanford.edu https://github.com/ENCODE-DCC/ The ENCODE DCC is funded by NHGRI Grant U41HG006992
  • 16. ENCODE Uniform Processing Pipeline Work Ben Hitz, Seth Strattan, Nikhil Podduturi ChIP-seq against transcription factors: Anshul Kundaje ChIP-seq against histone marks: Anshul Kundaje RNA-seq: ENCODE RNA working group Whole genome bisulfite sequencing: Junko Tsuji, ZhipingWeng DNAse-seq: Alvin Qin, Shirley Liu DNAnexus (PaaS): Brett Hannigan, Andrey Kislyuk, Mike Lin, Singer Ma, Ohad Rodeh NVIDIA Corporation: NVIDIA Academic Hardware donation program donation of two Kepler K40 GPU; NVIDIA’s NVBIO framework

Editor's Notes

  1. Can you identify a common data model? Can you select a set of metadata that can be