Cloud Computing and Innovations
for Optimizing Life Sciences
Research
Krittika Sasmal @ InterpretOmics
19th
February 2014
2
Acknowledgement
Organizers of this event
Scientists and Researchers who work for others and
make their research findings OPEN
Entrepreneurs who translate this knowledge
GNU, Open Software, NIH and other funding bodies
that keep Biology & Medical information OPEN
19.02.14
19.02.14 3
4 Grand Social Challenges
Food Security
Health Security
Energy Security
Environmental Security
Common Thread is Biology
19.02.14 4
The 21st Century Biology
– The Quantitative Biology
Ref: A New Biology for the 21st Century, The National Academies
Will create a discovery
engine able to
tackle extremely
complex biological
and societal
problems
19.02.14 5
Decoding the Book of Life
– milestone for Quantitative Biology
A Milestone for Humanity – the Human genome
Human Genome Completed, June 26th, 2000
19.02.14 6
Journey is
- From Reduction to Integration
− There are many diseases that were researched and understood through the process of reduction
− However as the understanding of diseases mature, and the need for proactive medicine increases, researchers find
that many diseases including cancer are due to somatic mutations that cannot be understood in the reduced space
− Understanding of such disease and discover a drug for these diseases will need a reverse operation - integration
and systems biology
Ref: Hiroaki Kitano, et al. Systems Biology: A Brief Overview, Science 295, 1662 (March 2002);
19.02.14 8
Translational Medicine
– Genomics + Clinical + Non-clinical Data to Discovery of Novel therapeutics
Data
Information
Knowledge
Literature/
Molecular Data
Clinical/Bedside Data
Medical
Knowledge
Target Data
Preprocessed
Data
Transformed
Data
Patterns
iOmics
Disease/Drug
Data
9
Next Generation Sequence Data
• FASTQ (Illumina)
• Sff (454)
• CCS (PacBio)
• ...
• Microarray
Single End
Sequences
Insert Size
Library Size
Sequence SequencePaired End or
Mate-paired
 
   


DNA/RNA/miRNA
OverlappedOverlapped reads

Random Order & Orientation
Long reads
Short reads
Fixed length reads
Variable length reads
cDNA/mRNA
Hundreds to Billions Bases
Circular Consensus reads
Billions to Hundreds Bases
19.02.14
Data domains and Challenges
Source: Clevergene Biocorp
1119.02.14
12
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach/philosophy
for data analysis that employs a variety of techniques (mostly
graphical) to
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings
19.02.14
Each of these is addressed through
algorithms to solve a computational
problem, typically based upon
optimizing some mathematical
criterion
1319.02.14
15
Trends in Genomic Medicine
J. J. McCarthy et al., Sci Transl Med 2013;5:189sr4
19.02.14
16
Data Driven Healthcare
Personalized Health Care
Translational
Medicine
Health Care
Today Digital Imaging
Episodic Treatment Electronic Health Records Artificial Expert Systems
Clinical Genomics
Genetic Predisposition Testing
Molecular Medicine
CA Diagnosis
Pre-symptomatic Treatment
Lifetime Treatment
Evolutionary Practices
RevolutionaryTechnology
Automated
Systems
Non-specific
(Treat Symptoms)
Information
Correlation
Organized
(Error Reduction)
Personalized
(Disease Prevention)
Data and Systems Integration
DistributedHigh-Throughput
Analytics
19.02.14
17
P6 Medicine
– Preventive, Predictive, Participatory, Personalized, Precision, and Pervasive
19.02.14
18
Personal Data
19.02.14
19
Population Data
Registry
Registry
Claims Data
Clinical Trial
Drug reaction
Literature
Genomic Data
Survivability
Public Health Epidemiology
Population Data
19.02.14
20
Biomedical DataClinical
Repositories
Online Mendelian
Inheritance in Man
Medical Subject
Heading
Genome/Gene
Annotations
University of Washington
Digital Anatomist
UWDA
NCBI
TaxonomiesHuman
Metabolome
RxNorm
Drug
ICD10
Logical Observation
Identifiers Names and
Code
UMLS
(Unified Medical
Language System)
19.02.14
7V's in Healthcare Big-data
 Vexing. Proper algorithm needs to be designed to ensure data
processing time is linear. Genomic data are generally NP-Hard and
proper parallel algorithms need to be designed to access data in a
near real-time manner.
 Volume. Physical volume of data that needs to be online. This
includes structured and unstructured data. Storage is available,
however, the challenge is to determine relevance within large data
volumes and how to use analytics to create value from relevant
data.
 Velocity. Data must be retrieved in a timely manner. In healthcare,
many data sources are outside the enterprise. Reacting quickly
enough to deal with data velocity is critical for most healthcare
applications.
2119.02.14
 Variety. Data today comes in all types of formats. Structured,
numeric, unstructured data like CT scan, MRI, Ultrasound, X-Rays
etc. in different forms. Also, most healthcare data is categorical,
with or without any order. Managing, merging and governing
different varieties of data is a challenge.
 Variability. Healthcare data are highly inconsistent with periodic
peaks. Cancer for example has four different types of variability viz.,
Intratumoral, Intermetastatic, Intrametastatic, and Interpatient.
Discovery of independent variables and the causal attributes are
critical.
 Veracity. Quality, relevance, repeatability, quantification,
meaningfulness, predictive value, reduction of error
 Value. The final result and its quantification from ROI or reduction of
readmission or reducing the morbidity is the key that will finally be
measured.
2219.02.14
23
Healthcare Analytics & Decision
Support System
Analytics
of 7Vs
19.02.14
24
Big Data in Life Sciences
19.02.14
25
Question is, NOT whether we can
do it! But, HOW QUICKLY and
HOW ACCURATELY we can do
it; where, Speed, Repeatability,
Reliability, Predictability, and
Precision matter!
19.02.14
26
The Cloud
You don't buy a COW when you need
Milk!
Likewise, a Biologist, a Breeder, or a Clinician
need not Worry about Data Analysis, Algorithms,
Supercomputer, Pipelines, or even the Analytics and
the
Biomedical Informatics!
19.02.14
27
Cloud Computing Defined
Cloud computing is an emerging computing paradigm
where data and applications reside in the cyberspace,
it allows users to access their data and information
through any web-connected device be it fixed or
mobile.
Source: John B. Horrigan, Use of Cloud Computing Applications & Services, Data memo, PEW Internet &
American Life project, September 2008
19.02.14
28
Cloud Computing User – I (Amir)
19.02.14
29
Cloud Computing User – II (Fakir)
19.02.14
30
Characteristics of Cloud Computing
Virtual – Physical location and underlying
infrastructure details are transparent to users
Scalable – Able to break complex workloads into
pieces to be served across an incrementally
expandable infrastructure
Efficient – Services Oriented Architecture for
dynamic provisioning of shared compute resources
Flexible – Can serve a variety of workload types –
both consumer and commercial
19.02.14
31
Benefits of Cloud
Unlimited Resource
 Unlimited Computing power
 Unlimited storage (Filestore & online memory)
Users can use resources without owning anything –
converting Capex to Opex
Helping Green computing by lending out idle resources
through Cycle Scavenging
Pay as you go
19.02.14
32
Cloud Computing Stack
Facilities
Hardware
Facilities
Integration & Middleware
Data Metadata Content
Application
API
Presentation Modality Presentation Platform
Infrastructure as
a Service
Platform as a
Service
Software as a
Service
Connectivity & delivery
API
Facilities
Hardware
Facilities
Connectivity & delivery
API
Integration & Middleware
Q
O
E
&
Q
O
S
S
E
C
U
R
I
T
Y
User/
Customer/
Device
M
I
D
D
L
E
W
A
R
E
Original Cloud ProviderCloud VendorCloud User Next Gen
Network
Next
Generation
Network /
Intranet
19.02.14
33
Divide and Conquer: MapReduce
Output files
Split 1
Split 2
Split 3
Split 4
Worker
Worker
Worker
k1:v1
k3:v2
k1:v3
k2:v4
k2:v5
k4:v6
k1:v1,v3
k2:v4,v5
k3:v2
Worker
Worker
Worker Output 1
Output 2
k4:v6
Output 3
Master
Input Files
Map
Intermediate files
Reduce
Sort/Group
19.02.14
34
Open Source MapReduce
Hadoop
− Implemented in Java; enabled on Amazon
Twister
− Light weight new arrival in town
19.02.14
35
Security in the Cloud
Security in the cloud needs to answer few specific questions
like:
1. How much trust do you have on virtualized environment or
the hypervisors in the cloud as against your own physical
hardware?
2. How much trust do you have on cloud vendor versus your
own infrastructure?
3. How do you address regulatory and compliance
requirement in an environment when your application might
be running on an infrastructure in a foreign country?
19.02.14
36
New generation software in
bioinformatics needs to be:
Fast/Very fast software, with a low memory
consumption
Be able to handle and analyze TB of data
Store data efficiently to query
Distribute computation, not data
Focused, and useful in analysis
19.02.14
37
www.iomics.in
The Omics Lab in the Internet
19.02.14
Crop to Cancer
19.02.14
3919.02.14
Analysis Pipelines in iOMICS
QA/QC of raw reads
SNP/InDel/CNV analysis
miRNA discovery
Exome sequencing
ChIP sequencing
... (New additions)
Meta-analysis
Visualization of results
Omics data
Systems Biology
4019.02.14
4119.02.14
4219.02.14
4319.02.14
4519.02.14
46
SNP & CNV Analysis
19.02.14
4719.02.14
48
Hierarchical Clustering
19.02.14
49
Gene Interaction/Enrichment
19.02.14
50
19.02.14
51
Cloud computing Holds the Potential to
Address the Challenges and Transform
Biology and Heathcare
Krittika Sasmal
Email: “krittika” dot “sasmal” (at) “interpretomics” dot “co”
19.02.14

Cloud Computing and Innovations for Optimizing Life Sciences Research

  • 1.
    Cloud Computing andInnovations for Optimizing Life Sciences Research Krittika Sasmal @ InterpretOmics 19th February 2014
  • 2.
    2 Acknowledgement Organizers of thisevent Scientists and Researchers who work for others and make their research findings OPEN Entrepreneurs who translate this knowledge GNU, Open Software, NIH and other funding bodies that keep Biology & Medical information OPEN 19.02.14
  • 3.
    19.02.14 3 4 GrandSocial Challenges Food Security Health Security Energy Security Environmental Security Common Thread is Biology
  • 4.
    19.02.14 4 The 21stCentury Biology – The Quantitative Biology Ref: A New Biology for the 21st Century, The National Academies Will create a discovery engine able to tackle extremely complex biological and societal problems
  • 5.
    19.02.14 5 Decoding theBook of Life – milestone for Quantitative Biology A Milestone for Humanity – the Human genome Human Genome Completed, June 26th, 2000
  • 6.
    19.02.14 6 Journey is -From Reduction to Integration − There are many diseases that were researched and understood through the process of reduction − However as the understanding of diseases mature, and the need for proactive medicine increases, researchers find that many diseases including cancer are due to somatic mutations that cannot be understood in the reduced space − Understanding of such disease and discover a drug for these diseases will need a reverse operation - integration and systems biology Ref: Hiroaki Kitano, et al. Systems Biology: A Brief Overview, Science 295, 1662 (March 2002);
  • 7.
    19.02.14 8 Translational Medicine –Genomics + Clinical + Non-clinical Data to Discovery of Novel therapeutics Data Information Knowledge Literature/ Molecular Data Clinical/Bedside Data Medical Knowledge Target Data Preprocessed Data Transformed Data Patterns iOmics Disease/Drug Data
  • 8.
    9 Next Generation SequenceData • FASTQ (Illumina) • Sff (454) • CCS (PacBio) • ... • Microarray Single End Sequences Insert Size Library Size Sequence SequencePaired End or Mate-paired         DNA/RNA/miRNA OverlappedOverlapped reads  Random Order & Orientation Long reads Short reads Fixed length reads Variable length reads cDNA/mRNA Hundreds to Billions Bases Circular Consensus reads Billions to Hundreds Bases 19.02.14
  • 9.
    Data domains andChallenges Source: Clevergene Biocorp 1119.02.14
  • 10.
    12 Exploratory Data Analysis ExploratoryData Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to 1. Maximize insight into a data set 2. Uncover underlying structure 3. Extract important variables 4. Detect outliers and anomalies 5. Test underlying assumptions 6. Develop parsimonious models 7. Determine optimal factor settings 19.02.14
  • 11.
    Each of theseis addressed through algorithms to solve a computational problem, typically based upon optimizing some mathematical criterion 1319.02.14
  • 12.
    15 Trends in GenomicMedicine J. J. McCarthy et al., Sci Transl Med 2013;5:189sr4 19.02.14
  • 13.
    16 Data Driven Healthcare PersonalizedHealth Care Translational Medicine Health Care Today Digital Imaging Episodic Treatment Electronic Health Records Artificial Expert Systems Clinical Genomics Genetic Predisposition Testing Molecular Medicine CA Diagnosis Pre-symptomatic Treatment Lifetime Treatment Evolutionary Practices RevolutionaryTechnology Automated Systems Non-specific (Treat Symptoms) Information Correlation Organized (Error Reduction) Personalized (Disease Prevention) Data and Systems Integration DistributedHigh-Throughput Analytics 19.02.14
  • 14.
    17 P6 Medicine – Preventive,Predictive, Participatory, Personalized, Precision, and Pervasive 19.02.14
  • 15.
  • 16.
    19 Population Data Registry Registry Claims Data ClinicalTrial Drug reaction Literature Genomic Data Survivability Public Health Epidemiology Population Data 19.02.14
  • 17.
    20 Biomedical DataClinical Repositories Online Mendelian Inheritancein Man Medical Subject Heading Genome/Gene Annotations University of Washington Digital Anatomist UWDA NCBI TaxonomiesHuman Metabolome RxNorm Drug ICD10 Logical Observation Identifiers Names and Code UMLS (Unified Medical Language System) 19.02.14
  • 18.
    7V's in HealthcareBig-data  Vexing. Proper algorithm needs to be designed to ensure data processing time is linear. Genomic data are generally NP-Hard and proper parallel algorithms need to be designed to access data in a near real-time manner.  Volume. Physical volume of data that needs to be online. This includes structured and unstructured data. Storage is available, however, the challenge is to determine relevance within large data volumes and how to use analytics to create value from relevant data.  Velocity. Data must be retrieved in a timely manner. In healthcare, many data sources are outside the enterprise. Reacting quickly enough to deal with data velocity is critical for most healthcare applications. 2119.02.14
  • 19.
     Variety. Datatoday comes in all types of formats. Structured, numeric, unstructured data like CT scan, MRI, Ultrasound, X-Rays etc. in different forms. Also, most healthcare data is categorical, with or without any order. Managing, merging and governing different varieties of data is a challenge.  Variability. Healthcare data are highly inconsistent with periodic peaks. Cancer for example has four different types of variability viz., Intratumoral, Intermetastatic, Intrametastatic, and Interpatient. Discovery of independent variables and the causal attributes are critical.  Veracity. Quality, relevance, repeatability, quantification, meaningfulness, predictive value, reduction of error  Value. The final result and its quantification from ROI or reduction of readmission or reducing the morbidity is the key that will finally be measured. 2219.02.14
  • 20.
    23 Healthcare Analytics &Decision Support System Analytics of 7Vs 19.02.14
  • 21.
    24 Big Data inLife Sciences 19.02.14
  • 22.
    25 Question is, NOTwhether we can do it! But, HOW QUICKLY and HOW ACCURATELY we can do it; where, Speed, Repeatability, Reliability, Predictability, and Precision matter! 19.02.14
  • 23.
    26 The Cloud You don'tbuy a COW when you need Milk! Likewise, a Biologist, a Breeder, or a Clinician need not Worry about Data Analysis, Algorithms, Supercomputer, Pipelines, or even the Analytics and the Biomedical Informatics! 19.02.14
  • 24.
    27 Cloud Computing Defined Cloudcomputing is an emerging computing paradigm where data and applications reside in the cyberspace, it allows users to access their data and information through any web-connected device be it fixed or mobile. Source: John B. Horrigan, Use of Cloud Computing Applications & Services, Data memo, PEW Internet & American Life project, September 2008 19.02.14
  • 25.
    28 Cloud Computing User– I (Amir) 19.02.14
  • 26.
    29 Cloud Computing User– II (Fakir) 19.02.14
  • 27.
    30 Characteristics of CloudComputing Virtual – Physical location and underlying infrastructure details are transparent to users Scalable – Able to break complex workloads into pieces to be served across an incrementally expandable infrastructure Efficient – Services Oriented Architecture for dynamic provisioning of shared compute resources Flexible – Can serve a variety of workload types – both consumer and commercial 19.02.14
  • 28.
    31 Benefits of Cloud UnlimitedResource  Unlimited Computing power  Unlimited storage (Filestore & online memory) Users can use resources without owning anything – converting Capex to Opex Helping Green computing by lending out idle resources through Cycle Scavenging Pay as you go 19.02.14
  • 29.
    32 Cloud Computing Stack Facilities Hardware Facilities Integration& Middleware Data Metadata Content Application API Presentation Modality Presentation Platform Infrastructure as a Service Platform as a Service Software as a Service Connectivity & delivery API Facilities Hardware Facilities Connectivity & delivery API Integration & Middleware Q O E & Q O S S E C U R I T Y User/ Customer/ Device M I D D L E W A R E Original Cloud ProviderCloud VendorCloud User Next Gen Network Next Generation Network / Intranet 19.02.14
  • 30.
    33 Divide and Conquer:MapReduce Output files Split 1 Split 2 Split 3 Split 4 Worker Worker Worker k1:v1 k3:v2 k1:v3 k2:v4 k2:v5 k4:v6 k1:v1,v3 k2:v4,v5 k3:v2 Worker Worker Worker Output 1 Output 2 k4:v6 Output 3 Master Input Files Map Intermediate files Reduce Sort/Group 19.02.14
  • 31.
    34 Open Source MapReduce Hadoop −Implemented in Java; enabled on Amazon Twister − Light weight new arrival in town 19.02.14
  • 32.
    35 Security in theCloud Security in the cloud needs to answer few specific questions like: 1. How much trust do you have on virtualized environment or the hypervisors in the cloud as against your own physical hardware? 2. How much trust do you have on cloud vendor versus your own infrastructure? 3. How do you address regulatory and compliance requirement in an environment when your application might be running on an infrastructure in a foreign country? 19.02.14
  • 33.
    36 New generation softwarein bioinformatics needs to be: Fast/Very fast software, with a low memory consumption Be able to handle and analyze TB of data Store data efficiently to query Distribute computation, not data Focused, and useful in analysis 19.02.14
  • 34.
    37 www.iomics.in The Omics Labin the Internet 19.02.14
  • 35.
  • 36.
  • 37.
    Analysis Pipelines iniOMICS QA/QC of raw reads SNP/InDel/CNV analysis miRNA discovery Exome sequencing ChIP sequencing ... (New additions) Meta-analysis Visualization of results Omics data Systems Biology 4019.02.14
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    46 SNP & CNVAnalysis 19.02.14
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
    51 Cloud computing Holdsthe Potential to Address the Challenges and Transform Biology and Heathcare Krittika Sasmal Email: “krittika” dot “sasmal” (at) “interpretomics” dot “co” 19.02.14