Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

Conceptualizing And Prototyping A Scalable
Genomic Data Analysis Pipeline:
Using Project Glow And Apache Spark On Top Of AWS Databricks For
Analysing The Effect Of Each Gene Over A Period Of Time.
© Shadab Ali Khan

OUTLINE  Problem Statement
 The Human Genome Project
 Human genome sequencing cost
 Next generation human genome projects
 Human genomic data is a big data problem
 Existing bioinformatics tools
 Moore’s Law
 Distributed computing
 Cloud computing
 Apache Spark
 Amazon Web Services
 Databricks
 Project Glow
© Shadab Ali Khan

Problem Statement Why there is a need for a new scalable genomic
data analysis pipeline with data time travelling over
historical genomic data for studying the effects of
each genes in causing a particular disease.
© Shadab Ali Khan

The Human Genome Project  Proposed in 1987 by the US DOE
 Biology’s “Manhattan project”
 Officially started in 1989
 Joint effort of NIH and DOE in US
 Goal was
 to sequence 3 billion nucleotide basepairs in
the human genome
 to map and identify all the human genes
present in the DNA sequence
 for a cost of $1/base by 2005
 Completed in 2001 with cost of $1 per 700 bases
© Shadab Ali Khan

Human Genome
Sequencing Cost
 The cost of sequencing human genome in 2001 was
$100,000,000
 The cost was reducing linearly till 2007
 In accordance with Moore’s Law
 Sudden and profound out-pacing of Moore's Law
beginning in January 2008
 transition from Sanger to next-generation
sequencing
 The cost of sequencing today is $1000
© Shadab Ali Khan

Next Generation Human
Genome Projects
 Next-generation sequencing technology
 Exponential drop in the cost of human genome
sequencing
 Start of next-gen human genome projects
 biobank scale
 World’s largest human genome project by
Regeneron Genetic Center
 GenomeAsia 100k
 non-profit consortium
 On a mission to sequence and analyze
100,000 Asian individuals genome
 UK Biobank
 a large-scale biomedical database of half
a million UK participants
© Shadab Ali Khan

Next-Gen Human Genomic
Data is a Big Data Problem
© Shadab Ali Khan

Existing Bioinformatics
Tools
 Various bioinformatics tools are available
 GATK (GenomeAnalysisTk)
 vcfTools
 tabix
 SnpSift
 Plink
 awk
 Manage to master one of the tools
 learn all the command line options and sub
options
 all the input and output file formats
 gets complex quickly
 chances that some functionality or file format is
only supported by other tool
 Interoperability problem
 Most tools run on single node
 Each node has limited resources
 Big genomic data can not be stored and processed on
single node
© Shadab Ali Khan

Moore’s Law
© Shadab Ali Khan

Distributed Computing  Tipping point of Moore’s Law
 Performance of a single processor can’t be increased
further
 Multiple smaller processors are coupled together to
form multi-core processors
 Petabytes of big genomic data can’t be fit on single
server
 A new computing paradigm is required
 Distributed computing
 Storage and computation capacity is distributed
across multiple cheaper servers
 A chunk of the complete data is stored across
different nodes and processed there individually
 Can be cheaply scaled on cloud
© Shadab Ali Khan

Cloud Computing
 On-demand delivery of IT resources over the Internet
 servers, storage, databases, networking,
software, analytics, and intelligence
 Instead of buying, owning, and maintaining physical
data centers and servers, you can access these
services on an as-needed basis
 Pay only for cloud services you use
 lower your operating costs
 run your infrastructure more efficiently
 scale as your business needs change
 Benefits of cloud computing
 Cost savings
 Global deployment in minutes
 Elasticity
 Reliability
 Security
 AWS, GCP, Azure, Alibaba Cloud, etc.
© Shadab Ali Khan

Apache Spark  Used for large scale processing and machine learning
 Uses distributed computing
 Scales horizontally to 1000s of nodes
 Capable of storing pettabytes of big data
 Able to process pettabytes of big data in interactive
time interval
 Runs on Hadoop, Apache Mesos, Kubernetes,
standalone, or in the cloud
 managed solutions available as part of Amazon
EMR, Google Cloud Dataproc, and Microsoft Azure
HDInsight.
 Access data in HDFS, Amazon S3, Google BigQuery.
 Can take advantage of cloud computing
 instantaneous upscaling and downscaling of the
Spark cluster
 minimizes cluster operational cost
 Provides high level API in Java, Scala, Python and R
© Shadab Ali Khan

Amazon Web Services  AWS is Amazon’s cloud computing platform
 AWS was launched in July 2002
 Various AWS services used to expose service using
SOA but later they have started using microservices
 EC2 (Elastic Compute Cloud)
 virtual machines in cloud with OS level control
 launched in August 2006
 S3 (Simple Storage Service)
 storage service to store objects like files,
folders, images, documents, songs, etc.
 can’t be used to install OS or software
 launched on March 14 2006
© Shadab Ali Khan

Amazon Web Services
 EMR (Elastic Map Reduce)
 managed big data cluster platform
 simplifies running big data framework
 decoupling compute and storage
 scale each independently
© Shadab Ali Khan

Databricks
 Databricks
 highly optimised managed spark compute
clusters
 uses popular cloud services in the backend for
decoupled storage and processing
 can run on top of AWS, Azure, GCP
 30% performance gain over trational AWS spark
© Shadab Ali Khan

Delta Lake  Delta Lake is developed by Databricks
 It improves the OLAP workload performance
 combines the transactional reliability
 of databases with the horizontal scalibility of the
data lakes
 Properties of Delta Lakes
 ACID guarentees
 Scalable data and metadata handling
 Audit History and Time travel
 Schema enforcement and schema evolution
 Support for deletes, updates, and merges
 Streaming and batch unification
 Fully integrated within Apache Spark ecosystem
 Brings ACID transactions to Apache Spark and big
data workloads
 Supports APIs in SQL, Java, Scala, and Python, etc.
© Shadab Ali Khan

Delta Lake  Data in Delta Lakes are stored in Delta tables.
 Delta tables can store data in file systems like HDFS
and cloud object stores like S3, etc.
 Delta tables are designed to be written primarily by
Spark applications.
 Delta tables can be read by many open source data
engines like Spark SQL, Hive, Presto, and several
enterprise products like AWS Athena, Azure
Synapse, Big Query, etc.
© Shadab Ali Khan

Project Glow  Open-source toolkit for large-scale genomic analysis at
biobank-scale
 Natively built on Apache Spark
 Developed by Databricks in collaboration with
Regeneron Genetics Center
 Backward compatible with all bioinformatics tools
 Works with all kind of file formats
 .fasta, .fastq, .sam, .bam, .vcf, .gff,
 .bgen, etc.
 Data of above file formats can be loaded into Spark
DataFrames
 Provides functions for performing quality control and
data manipulation
 Variant normalization
 Integration with Spark ML libraries for population
stratification
 Provides API using the native Spark SQL APIs in Python,
SQL, R, Java, and Scala.
© Shadab Ali Khan

Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

Recommended

Recommended

More Related Content

Similar to Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks

Similar to Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks (20)

Recently uploaded

Recently uploaded (20)

Scalable Genomic Data Analysis Pipeline Using Project Glow and Apache Spark on AWS Databricks