(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Puneet Suri, Thermo Fisher Scientific
Shakila Pothini, Thermo Fisher Scientific
Sami Zuhuruddin, Amazon Web Services
November 12, 2014 | Las Vegas, NV
HLS402
Getting into Your Genes: The definitive guide to using Amazon
EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-
Performance Scientific Applications

About me
Puneet Suri
Senior Director, Software Engineering
Life Sciences Group, Thermo Fisher Scientific
follow at: @psuriconnect at: puneet.suri@thermofisher.com
Envisionedanddeveloped the life sciences cloud platform for Thermo Fisher Scientific

Having an impact…
A person was set free after 35 years in prison because of a DNA test
Freeing the innocent
Surviving Cancer
A person survived pancreatic cancer thanks to a genetic approach that allowed an oncologist to focus on a specific cancer cell
Ebola

H1N1: Pandemic declared in April 2009

Need to enable this at larger scale & impact more lives

Customer needs…
store & manage large scientific data sets

Our offerings
desktop applications
challenges with upgrade cycle, versions etc.
limited storage and compute capacity
to analyze complex & large data sets
no sharing & collaboration
no backup, archive & security

Abetter way… is to provide
STORAGE
COMPUTE
SCALABILITY
MEMORY

Aday with the scientist
Get Insights




aproject
*
*
*
*

Insights…
•what is causing cancer
•what drugs will work
•is therapy working

Customer pain points
•existing solutions cannot address the complexities
•excel is used painfully to manually analyze data
•multiple tools used to get the final insight
•it takes days to analyze the data
•some of the analysis workflow are not possible

Dimensions of complexity…
millions
of
records
thousands of users,
projects
real time analysis of large datasets
2-3 seconds response time
project
storage
compute
performance
scalability

Our journey enabling complex customer workflows

Our iterative journey & challenges
0
start with reference architecture
1
identify scalable storage solution
2
identify scalable storage solution for large data items
3
identify solutionsfor real time response & queries
4
Identify solutions for real time analysis ofdata

Reference web architecture
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Amazon
RDS
MASTER
Amazon
RDS
STANDBY
Synchronous Replication
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS

Why relational DB was not considered
•based on projected data and user growth over the years (hundreds of TBs), required real-time query performance very hard to achieve
•needed managed scalability without sharding/re-shardingoverhead and disruptions
•needed a loose schema to seamlessly enable new and cross domain workflows

NoSQL was the way to go
•managed scalability
•near zero administration overhead
•query performance not impacted by table sizecan add billions of rows
•simple and flexible schema –new domains can be supported
•extremely fast read/write performance

Architecture with DynamoDB
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
AmazonDynamoDB

What worked well with DynamoDB
Managed Service with flexible schema
Managed Scalability
Extremely fast access in order of milliseconds
READ/WRITE

Iteration 1
GBs
GBs
MBs
MBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage
Query
Performance
✔
✔
Cost
✔
✔
Get Insights




project

What were the gaps
our item attribute (e.g.Instrument Run) size range > 400KB
(item attribute size limitation of 64KB400KB)
hot hash key& batch size limitations
•Adding thousands of related records (e.g. Raw Signals) with common hash key (e.g. Instrument Run) can be slow (10s seconds)
•a large project can have ~ 1 million records (e.g. Raw Signals) that needs to read & written
for a large project, high read/write capacity (1000s) was needed
(increased cost due to high READ/WRITE capacity needs)

What we needed
Asolution that
•can store huge number of related objects
•is cost effective to read/write large data sets
•has no limitations on batch size or item size
•ability to query into the large number of records

0
1
identify scalable storage solution : DynamoDB
2
identify scalable storage solution for large data items
3
identify solutionsfor real time response & queries
4

Architecture with DynamoDB & S3
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3

MBs
MBs
GBs
Iteration 2
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage
Query
Performance
✔
✔
Cost
✔
✔
Get Insights





Architecture with DynamoDB & S3
•DynamoDB was used to store small unrelated objects (KB)
•will grow to a large number (e.g. Data Files)
•Amazon S3 was used to store related larger objects (e.g. Raw Signals & Analysis Results (GB))
•stored as single Amazon S3 object serialized using google protobuf
•Amazon S3 was cost effective for storing huge objects

Real time queries for complex visualizations

What we needed
•complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user
•visualizations are very interactive that requires constant update of data. Need quick read & writes
•support concurrent access without any degradation in query performance

0
1
2
identify scalable storage solution for large data items :DynamoDB + AmazonS3
3
identify solutionsfor fast real time response & queries
3

Distributed in-memory storage was the way to go
read/writes have to be quick to enable fast response times, reading & writing from Amazon S3 was not ideal.
•ElastiCachewas used as IN-MEMORY storage on top of DynamoDB& Amazon S3.
•all related serialized objects in Amazon S3 accessed by customers is maintained in ElastiCacheas individual records
•Indexes created in DynamoDB based on the query pattern so that data can be easily retrieved from ElastiCache

Architecture with DynamoDB, Amazon S3 & ElastiCache
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache

Iteration 3
MBs
MBs
GBs
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
indexes
Get Insights
   

Need for real time data analysis
•analyze huge projects containing thousands of patient samples in minutes instead of days
•a scalable solution is required to support analysis requests from thousands of users
•existing desktop algorithms used for this analysis not optimized for extracting parallelism in data

8
20
40
80
120
200
320
0
50
100
150
200
250
300
350
90000
180000
270000
360000
450000
675000
900000
desktop
desktop
Analysis solutions in desktop
desktop
crashes
minutes
# of records
Get Insights





0
startwith reference architecture
1
2
identify scalable storage solution for large data items : DynamoDB + Amazon S3
3
identify solutionsfor fast real time response & queries : DynamoDB + Amazon S3 + ElastiCache
4

Amazon EMR was the way to go
1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of days
2.all small jobs analyzed in-memory while big ones are sent toAmazon EMR.
3.existing algorithms overhauled to derive massive parallelism using Hadoopmap-reduce framework
4.as large datasets already in Amazon S3, used Amazon S3 for input and output instead of HDFS –only intermediate map-reduce data in HDFS
5.Amazon EMR cluster is created On-Demand and shutdown when done

Architecture with EMR for real time
analysis
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
EMR

Iteration 4
MBs
MBs
GBs
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query Analysis
Performance ✔ ✔ ✔
Cost ✔ ✔ ✔
Get Insights
   

Performance for a project
2
4
7
11
13
20
30
0
50
100
150
200
250
300
350
90000
180000
270000
360000
450000
675000
900000
cloud
desktop
>10x
crashes
minutes
# of records

Journey
0
1
2
identify scalable storage solution for large data items : DynamoDB + Amazon S3
3
identify solutionsfor fast real time response & queries : DynamoDB+ Amazon S3 + ElastiCache
4
Identify solutions for real time analysis ofdata : Amazon EMR
✓
✓
✓
✓
✓

Learnings
•
•
•
•
•
•

About me : Shakila Pothini
Senior Manger, Cloud Apps
Life Sciences Group,
Thermo Fisher Scientific
Hiking is my ONLY stress buster
Entertain to Educate.
Cofounder of performing arts group (swaram.org)
Mostly left brained with
occasional sense of creativity
*
*
*

How to get into your gene?
sequence the human entire transcriptome (30,000 genes)
identify significant genes
(100+ genes)
validate & reconfirm the
(20+ genes)
do it on more samples & different population
find the way the genes interplay in the pathway
understand cancer diversity.
types of therapy.
drug-able genes.

Demo summary
non cancerous sample
cancerous sample
difference in expression of genes

Customer feedback
“My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ analysis was fast and user friendly”
UPENN
“I enjoy processing hundreds of open array plates with ease.”, “I appreciate the rapid access of the large number of amplification curves ”
Sanofi
“I wanted to let you know that Symphonihas been working well for me. I have done analysis using as high as 500 files. ”
ASU
“This I see value in... utilizing these features. I appreciate the speed of data processing and visuals.”
LUMC

Yearly checkup today
165 / 105
120
50 / 90
104

Is this really going to detect early stages of cancer?

A few years from now : every person
ATGCATGCTATCAATTGCCC
Sequence
melanoma
healthrisks
drug response
powered by AWS
lifecloud

Yearly check-ups a few years from now
ATGCATGC ATTGCCC
ATGCATGC ATTGCCC
TATCA
GCATG
lifecloud
Sequence

Yearly check-ups a few years from now (cont’d)
cancer
any clinical
trial?
healthrisks
drug response
Sequence
lifecloud
prescribe the
right drug

Puneet Suri
Senior Director, Software engineering
puneet.suri@thermofisher.comT: 650.266.5857 @psuri
ShakilaPothini
Senior Manager, Cloud Applications
shakila.pothini@thermofisher.com
SalilKumar
Cloud Architect
T: 650.740.1646 @salilkum

Collect /
Ingest
Kinesis
Process / Analyze
EMR
EC2
Redshift
Data Pipeline
Visualize /
Report
Glacier
S3
DynamoDB
Store
RDS
Data Answers

Experiment 1
Data Access
Compute Time
Experiment 2
Experiment 3
Data Access
Compute Time
Data Access
Compute Time
✔
✔
✔

EMR
Cluster
EC2
Instance
Data Temperature

Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals
Puneet Suri
Senior Director, Software engineering
puneet.suri@thermofisher.com
T: 650.266.5857 @psuri
Shakila Pothini
Senior Manager, Cloud Applications
shakila.pothini@thermofisher.com
T: 650.554.2190
Salil Kumar
Cloud Architect
T: 650.740.1646 @salilkum

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014