SlideShare a Scribd company logo
1 of 19
Conceptualizing And Prototyping A Scalable
Genomic Data Analysis Pipeline:
Using Project Glow And Apache Spark On Top Of AWS Databricks For
Analysing The Effect Of Each Gene Over A Period Of Time.
© Shadab Ali Khan
OUTLINE  Problem Statement
 The Human Genome Project
 Human genome sequencing cost
 Next generation human genome projects
 Human genomic data is a big data problem
 Existing bioinformatics tools
 Moore’s Law
 Distributed computing
 Cloud computing
 Apache Spark
 Amazon Web Services
 Databricks
 Project Glow
© Shadab Ali Khan
Problem Statement Why there is a need for a new scalable genomic
data analysis pipeline with data time travelling over
historical genomic data for studying the effects of
each genes in causing a particular disease.
© Shadab Ali Khan
The Human Genome Project  Proposed in 1987 by the US DOE
 Biology’s “Manhattan project”
 Officially started in 1989
 Joint effort of NIH and DOE in US
 Goal was
 to sequence 3 billion nucleotide basepairs in
the human genome
 to map and identify all the human genes
present in the DNA sequence
 for a cost of $1/base by 2005
 Completed in 2001 with cost of $1 per 700 bases
© Shadab Ali Khan
Human Genome
Sequencing Cost
 The cost of sequencing human genome in 2001 was
$100,000,000
 The cost was reducing linearly till 2007
 In accordance with Moore’s Law
 Sudden and profound out-pacing of Moore's Law
beginning in January 2008
 transition from Sanger to next-generation
sequencing
 The cost of sequencing today is $1000
© Shadab Ali Khan
Next Generation Human
Genome Projects
 Next-generation sequencing technology
 Exponential drop in the cost of human genome
sequencing
 Start of next-gen human genome projects
 biobank scale
 World’s largest human genome project by
Regeneron Genetic Center
 GenomeAsia 100k
 non-profit consortium
 On a mission to sequence and analyze
100,000 Asian individuals genome
 UK Biobank
 a large-scale biomedical database of half
a million UK participants
© Shadab Ali Khan
Next-Gen Human Genomic
Data is a Big Data Problem
© Shadab Ali Khan
Existing Bioinformatics
Tools
 Various bioinformatics tools are available
 GATK (GenomeAnalysisTk)
 vcfTools
 tabix
 SnpSift
 Plink
 awk
 Manage to master one of the tools
 learn all the command line options and sub
options
 all the input and output file formats
 gets complex quickly
 chances that some functionality or file format is
only supported by other tool
 Interoperability problem
 Most tools run on single node
 Each node has limited resources
 Big genomic data can not be stored and processed on
single node
© Shadab Ali Khan
Moore’s Law
© Shadab Ali Khan
Distributed Computing  Tipping point of Moore’s Law
 Performance of a single processor can’t be increased
further
 Multiple smaller processors are coupled together to
form multi-core processors
 Petabytes of big genomic data can’t be fit on single
server
 A new computing paradigm is required
 Distributed computing
 Storage and computation capacity is distributed
across multiple cheaper servers
 A chunk of the complete data is stored across
different nodes and processed there individually
 Can be cheaply scaled on cloud
© Shadab Ali Khan
Cloud Computing
 On-demand delivery of IT resources over the Internet
 servers, storage, databases, networking,
software, analytics, and intelligence
 Instead of buying, owning, and maintaining physical
data centers and servers, you can access these
services on an as-needed basis
 Pay only for cloud services you use
 lower your operating costs
 run your infrastructure more efficiently
 scale as your business needs change
 Benefits of cloud computing
 Cost savings
 Global deployment in minutes
 Elasticity
 Reliability
 Security
 AWS, GCP, Azure, Alibaba Cloud, etc.
© Shadab Ali Khan
Apache Spark  Used for large scale processing and machine learning
 Uses distributed computing
 Scales horizontally to 1000s of nodes
 Capable of storing pettabytes of big data
 Able to process pettabytes of big data in interactive
time interval
 Runs on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud
 managed solutions available as part of Amazon
EMR, Google Cloud Dataproc, and Microsoft Azure
HDInsight.
 Access data in HDFS, Amazon S3, Google BigQuery.
 Can take advantage of cloud computing
 instantaneous upscaling and downscaling of the
Spark cluster
 minimizes cluster operational cost
 Provides high level API in Java, Scala, Python and R
© Shadab Ali Khan
Amazon Web Services  AWS is Amazon’s cloud computing platform
 AWS was launched in July 2002
 Various AWS services used to expose service using
SOA but later they have started using microservices
 EC2 (Elastic Compute Cloud)
 virtual machines in cloud with OS level control
 launched in August 2006
 S3 (Simple Storage Service)
 storage service to store objects like files,
folders, images, documents, songs, etc.
 can’t be used to install OS or software
 launched on March 14 2006
© Shadab Ali Khan
Amazon Web Services
 EMR (Elastic Map Reduce)
 managed big data cluster platform
 simplifies running big data framework
 decoupling compute and storage
 scale each independently
© Shadab Ali Khan
Databricks
 Databricks
 highly optimised managed spark compute
clusters
 uses popular cloud services in the backend for
decoupled storage and processing
 can run on top of AWS, Azure, GCP
 30% performance gain over trational AWS spark
© Shadab Ali Khan
Delta Lake  Delta Lake is developed by Databricks
 It improves the OLAP workload performance
 combines the transactional reliability
 of databases with the horizontal scalibility of the
data lakes
 Properties of Delta Lakes
 ACID guarentees
 Scalable data and metadata handling
 Audit History and Time travel
 Schema enforcement and schema evolution
 Support for deletes, updates, and merges
 Streaming and batch unification
 Fully integrated within Apache Spark ecosystem
 Brings ACID transactions to Apache Spark and big
data workloads
 Supports APIs in SQL, Java, Scala, and Python, etc.
© Shadab Ali Khan
Delta Lake  Data in Delta Lakes are stored in Delta tables.
 Delta tables can store data in file systems like HDFS
and cloud object stores like S3, etc.
 Delta tables are designed to be written primarily by
Spark applications.
 Delta tables can be read by many open source data
engines like Spark SQL, Hive, Presto, and several
enterprise products like AWS Athena, Azure
Synapse, Big Query, etc.
© Shadab Ali Khan
Project Glow  Open-source toolkit for large-scale genomic analysis at
biobank-scale
 Natively built on Apache Spark
 Developed by Databricks in collaboration with
Regeneron Genetics Center
 Backward compatible with all bioinformatics tools
 Works with all kind of file formats
 .fasta, .fastq, .sam, .bam, .vcf, .gff,
 .bgen, etc.
 Data of above file formats can be loaded into Spark
DataFrames
 Provides functions for performing quality control and
data manipulation
 Variant normalization
 Integration with Spark ML libraries for population
stratification
 Provides API using the native Spark SQL APIs in Python,
SQL, R, Java, and Scala.
© Shadab Ali Khan
Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

More Related Content

Similar to Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAmazon Web Services
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare Mostafa
 
Equinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyEquinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyPraveen Kumar
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSjavier ramirez
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...Amazon Web Services
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Jamie Kinney
 
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014Amazon Web Services
 
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017Amazon Web Services
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperVasu S
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysisRaminder Singh
 

Similar to Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks (20)

Accelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of GenomicsAccelerating Analytics for the Future of Genomics
Accelerating Analytics for the Future of Genomics
 
963
963963
963
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Big data talking stories in Healthcare
Big data talking stories in Healthcare Big data talking stories in Healthcare
Big data talking stories in Healthcare
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Equinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journeyEquinix Big Data Platform and Cassandra - A view into the journey
Equinix Big Data Platform and Cassandra - A view into the journey
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
(BDT302) Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR ...
 
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
Astroinformatics 2014: Scientific Computing on the Cloud with Amazon Web Serv...
 
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
Architectures for HPC and HTC Workloads on AWS | AWS Public Sector Summit 2017
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | WhitepaperThe Open Data Lake Platform Brief - Data Sheets | Whitepaper
The Open Data Lake Platform Brief - Data Sheets | Whitepaper
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
 
kumarResume
kumarResumekumarResume
kumarResume
 

Recently uploaded

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad EscortsCall girls in Ahmedabad High profile
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 

Recently uploaded (20)

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
(ISHITA) Call Girls Service Hyderabad Call Now 8617697112 Hyderabad Escorts
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 

Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

  • 1. Conceptualizing And Prototyping A Scalable Genomic Data Analysis Pipeline: Using Project Glow And Apache Spark On Top Of AWS Databricks For Analysing The Effect Of Each Gene Over A Period Of Time. © Shadab Ali Khan
  • 2. OUTLINE  Problem Statement  The Human Genome Project  Human genome sequencing cost  Next generation human genome projects  Human genomic data is a big data problem  Existing bioinformatics tools  Moore’s Law  Distributed computing  Cloud computing  Apache Spark  Amazon Web Services  Databricks  Project Glow © Shadab Ali Khan
  • 3. Problem Statement Why there is a need for a new scalable genomic data analysis pipeline with data time travelling over historical genomic data for studying the effects of each genes in causing a particular disease. © Shadab Ali Khan
  • 4. The Human Genome Project  Proposed in 1987 by the US DOE  Biology’s “Manhattan project”  Officially started in 1989  Joint effort of NIH and DOE in US  Goal was  to sequence 3 billion nucleotide basepairs in the human genome  to map and identify all the human genes present in the DNA sequence  for a cost of $1/base by 2005  Completed in 2001 with cost of $1 per 700 bases © Shadab Ali Khan
  • 5. Human Genome Sequencing Cost  The cost of sequencing human genome in 2001 was $100,000,000  The cost was reducing linearly till 2007  In accordance with Moore’s Law  Sudden and profound out-pacing of Moore's Law beginning in January 2008  transition from Sanger to next-generation sequencing  The cost of sequencing today is $1000 © Shadab Ali Khan
  • 6. Next Generation Human Genome Projects  Next-generation sequencing technology  Exponential drop in the cost of human genome sequencing  Start of next-gen human genome projects  biobank scale  World’s largest human genome project by Regeneron Genetic Center  GenomeAsia 100k  non-profit consortium  On a mission to sequence and analyze 100,000 Asian individuals genome  UK Biobank  a large-scale biomedical database of half a million UK participants © Shadab Ali Khan
  • 7. Next-Gen Human Genomic Data is a Big Data Problem © Shadab Ali Khan
  • 8. Existing Bioinformatics Tools  Various bioinformatics tools are available  GATK (GenomeAnalysisTk)  vcfTools  tabix  SnpSift  Plink  awk  Manage to master one of the tools  learn all the command line options and sub options  all the input and output file formats  gets complex quickly  chances that some functionality or file format is only supported by other tool  Interoperability problem  Most tools run on single node  Each node has limited resources  Big genomic data can not be stored and processed on single node © Shadab Ali Khan
  • 10. Distributed Computing  Tipping point of Moore’s Law  Performance of a single processor can’t be increased further  Multiple smaller processors are coupled together to form multi-core processors  Petabytes of big genomic data can’t be fit on single server  A new computing paradigm is required  Distributed computing  Storage and computation capacity is distributed across multiple cheaper servers  A chunk of the complete data is stored across different nodes and processed there individually  Can be cheaply scaled on cloud © Shadab Ali Khan
  • 11. Cloud Computing  On-demand delivery of IT resources over the Internet  servers, storage, databases, networking, software, analytics, and intelligence  Instead of buying, owning, and maintaining physical data centers and servers, you can access these services on an as-needed basis  Pay only for cloud services you use  lower your operating costs  run your infrastructure more efficiently  scale as your business needs change  Benefits of cloud computing  Cost savings  Global deployment in minutes  Elasticity  Reliability  Security  AWS, GCP, Azure, Alibaba Cloud, etc. © Shadab Ali Khan
  • 12. Apache Spark  Used for large scale processing and machine learning  Uses distributed computing  Scales horizontally to 1000s of nodes  Capable of storing pettabytes of big data  Able to process pettabytes of big data in interactive time interval  Runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud  managed solutions available as part of Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight.  Access data in HDFS, Amazon S3, Google BigQuery.  Can take advantage of cloud computing  instantaneous upscaling and downscaling of the Spark cluster  minimizes cluster operational cost  Provides high level API in Java, Scala, Python and R © Shadab Ali Khan
  • 13. Amazon Web Services  AWS is Amazon’s cloud computing platform  AWS was launched in July 2002  Various AWS services used to expose service using SOA but later they have started using microservices  EC2 (Elastic Compute Cloud)  virtual machines in cloud with OS level control  launched in August 2006  S3 (Simple Storage Service)  storage service to store objects like files, folders, images, documents, songs, etc.  can’t be used to install OS or software  launched on March 14 2006 © Shadab Ali Khan
  • 14. Amazon Web Services  EMR (Elastic Map Reduce)  managed big data cluster platform  simplifies running big data framework  decoupling compute and storage  scale each independently © Shadab Ali Khan
  • 15. Databricks  Databricks  highly optimised managed spark compute clusters  uses popular cloud services in the backend for decoupled storage and processing  can run on top of AWS, Azure, GCP  30% performance gain over trational AWS spark © Shadab Ali Khan
  • 16. Delta Lake  Delta Lake is developed by Databricks  It improves the OLAP workload performance  combines the transactional reliability  of databases with the horizontal scalibility of the data lakes  Properties of Delta Lakes  ACID guarentees  Scalable data and metadata handling  Audit History and Time travel  Schema enforcement and schema evolution  Support for deletes, updates, and merges  Streaming and batch unification  Fully integrated within Apache Spark ecosystem  Brings ACID transactions to Apache Spark and big data workloads  Supports APIs in SQL, Java, Scala, and Python, etc. © Shadab Ali Khan
  • 17. Delta Lake  Data in Delta Lakes are stored in Delta tables.  Delta tables can store data in file systems like HDFS and cloud object stores like S3, etc.  Delta tables are designed to be written primarily by Spark applications.  Delta tables can be read by many open source data engines like Spark SQL, Hive, Presto, and several enterprise products like AWS Athena, Azure Synapse, Big Query, etc. © Shadab Ali Khan
  • 18. Project Glow  Open-source toolkit for large-scale genomic analysis at biobank-scale  Natively built on Apache Spark  Developed by Databricks in collaboration with Regeneron Genetics Center  Backward compatible with all bioinformatics tools  Works with all kind of file formats  .fasta, .fastq, .sam, .bam, .vcf, .gff,  .bgen, etc.  Data of above file formats can be loaded into Spark DataFrames  Provides functions for performing quality control and data manipulation  Variant normalization  Integration with Spark ML libraries for population stratification  Provides API using the native Spark SQL APIs in Python, SQL, R, Java, and Scala. © Shadab Ali Khan