(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Puneet Suri, Thermo Fisher Scientific
Shakila Pothini, Thermo Fisher Scientific
October 2015
Decoding the Genetic Blueprint of Life
on a Cloud-Connected Ecosystem
ARC311

About Me
Puneet Suri
Senior Director,Software Engineering
Life Sciences Group,Thermo Fisher Scientific
follow at: @psuri  connect at: puneet.suri@thermofisher.com
Envisioned and developed the life sciences cloud platformfor Thermo Fisher
Scientific

4.5 hours of TV
11 hours of online reading
5 hours of radio
2 hours of computer
5 hours of handheld
1.13 hours of phone

Earth to Moon
6000 times
Earth to Sun
30 times
DNA from all cells
37.2
Trillion
Cells in our bodyDNA in single cell
6 feet long
Human genome
200
Gigabytes
Population of 7 billion
1.4
Zettabytes
Basepairs in DNA
3.2 billion
2 Bytes/basepair
ATGC

Infectious Disease
Top Threats
Environmental Degradation Bioterrorism

S
P
A
N
I
S
H
F
L
U
1918
A
I
D
S
1961
1958 1968
S
W
I
N
E
F
L
U
1976
1999
S
A
R
S
2002
2003
M
U
M
P
S
2006
2009
E
.
C
O
L
I
2011
2013
E
B
O
L
A
2014
Timelinesofinfectiousthreats
H
O
N
G
K
O
N
G
F
L
U
A
V
I
A
N
F
L
U
W
E
S
T
N
I
L
E
V
I
R
U
S
M
O
N
K
E
Y
P
O
X
H
1
N
1
M
E
R
S

Ourvision We enable our customer to make the world…

Having an Impact…
A person was set free
after 35 years in prison
because of a DNA test
Freeing the innocent
H1N1 pandemic
Ebola
CDC : Swine flu viruses in U.S and Mexico match
7500 Instrument

What to Expect from the Session
• Refresh / overview of our architecture
• Evolution of the architecture
• Connected ecosystem
• Future possibilities

A Day with the Scientist
Get Insights

a project
* * * *

Insights
• What is causing cancer?
• Is the disease a genetic hereditary?
• What drugs will work?
• Is therapy working?

Data Model: Project
GBs
GBs
MBs
Parse
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Analysis Results
(millions)
project
MBs
Raw Signals
(millions)

desktop apps no backup, archive, or securitylimited analysis limited collaboration
ourofferings
multiple tools
painpoints
slow analysis unsupported workflowsExcel nightmare

Real-time analysis
highly interactive visualizations
2-3 second response time
easy to share data &
collaborate
store & manage large
scientific data sets
Customerneeds…

A Better Way Is to Provide…
 STORAGE
 COMPUTE
 SCALABILITY
 MEMORY

Our Journey
Enabling Complex Customer Workflows

Dimensions of Complexity
millions
of
records
1000s of
users,
projects
Real-time
analysis
of large
data sets
2-3
second
response
time
project
storage compute performance scalability

Custom Storage Engine
To meet the challenges of storage, compute,
performance & scalability, a custom storage and query
engine had to be designed and implemented

Design Goals for Our Custom Storage Engine
• High availability (HA) out of the box – no config
needed to turn on HA
• Completely decentralized, so no SPOF
• Managed scalability such that no admin input
required to scale storage engine as data volume
and concurrent requests go up
• Manageable total cost of ownership (TCO)

Custom Storage Engine Design
Redundant S3 storage
for big data sets
Amazon S3
DynamoDB for storing
indexes and small objects
DynamoDB
In-memory caching
for faster access using
index-based retrieval
ElastiCache

Our Iterative Journey & Challenges
0 Start with reference architecture
1 Identify scalable storage solution
2 identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data

NoSQL (DynamoDB)
• Managed scalability
• Near zero administration overhead
• Query performance not impacted by table size;
can add billions of rows
• Key value store allows for flexibility, so new
domains can be supported
• Read/write performance in order of tens of ms

Architecture with DynamoDB
A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
Amazon
DynamoDB

Raw Signals
(millions)
Iteration 1
MBs
Parse
Storage Query
Performance ✔ ✔
Cost ✔ ✔
MBs
GBs
Get Insights

Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Analysis Results
(millions)
project
GBs

What Were the Gaps?
Our item attribute (e.g., Instrument Run) size range > 400 KB
(item attribute size limitation of 400 KB)
Hot hash key
• Adding thousands of related records (e.g., Raw Signals) with common hash
key (e.g., Instrument Run) can be slow (10s seconds)
For a large project, high read/write capacity (1000s) was needed
(increased cost due to high read/write capacity needs)
Batch size limitation of 25
• A large project can have ~1 million records (e.g., Raw
Signals) that needs to be read & written at the same time

What We Needed…
…was a solution that
• can store a huge number of related objects
• is cost effective for reading/writing large data sets
• has no limitations on batch size or item size
• has ability to query into the large number of records

1 Identify scalable storage solution: DynamoDB
2 Identify scalable storage solution for large data items
3 identify solutions for real time response & queries

A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
APP SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Architecture : DynamoDB & Amazon S3
Amazon S3

MBs
MBs
GBs
Iteration 2
GBs
Instrument Run
(1000s)
Patient Samples
(1000s)
Genes
(1000s)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
Get Insights

Raw Signals
(millions)

Real-time Queries for Complex Visualizations

2
Identify scalable storage solution for large data items : DynamoDB +
Amazon S3
3 Identify solutions for fast, real-time response & queries

Distributed In-Memory Storage: ElastiCache
• Queries have to perform in the order of milliseconds, so reading
& writing from S3 was not feasible for interactive use cases.
• ElastiCache was used as IN-MEMORY storage on top of
DynamoDB & Amazon S3.
• All related serialized objects in Amazon S3 part of a query
access pattern are maintained in ElastiCache as individual
records.
• Non-clustered indexes were created in DynamoDB based on the
query pattern so that data could be efficiently retrieved from
ElastiCache.

A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Architecture DynamoDB, Amazon S3 & ElastiCache
Auto Scaling
DynamoDB
Amazon S3
ElastiCache

Iteration 3
MBs
MBs
GBs
GBs
Patient Samples
(1000s)
Genes
(1000s)
Analysis Results
(millions)
Storage Query
Performance ✔ ✔
Cost ✔ ✔
Non-clustered
indexes
Get Insights

Instrument Run
(1000s)
Raw Signals
(millions)

Need for Real-time Data Analysis
• Analyze huge projects containing thousands of patient samples in
minutes instead of days
• A scalable solution is required to support analysis requests from
thousands of users
• Existing desktop algorithms used for this analysis are not optimized
for extracting parallelism in data

8
20
40
80
120
200
320
0
50
100
150
200
250
300
350
90000 180000 270000 360000 450000 675000 900000
Analysis Solutions on Desktop
Desktop
Crash
minutes
# of records

2
Identify scalable storage solution for large data items: DynamoDB +
Amazon S3
3
Identify solutions for fast, real-time response & queries: DynamoDB +
Amazon S3 + ElastiCache
4 Identify solutions for real-time analysis of data

Amazon EMR
• Amazon EMR was used to perform real-time analysis of huge data sets
(analysis results) in minutes instead of days.
• All small jobs analyzed in memory while big ones are sent to Amazon EMR.
• Existing algorithms overhauled to derive massive parallelism using Hadoop
map-reduce framework.
• As large data sets were already in Amazon S3, used Amazon S3 for input
and output instead of HDFS. Only intermediate map-reduce data in HDFS.
• Amazon EMR cluster is created on demand and shut down when done.

A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
Architecture : Amazon EMR for Real-time Analysis
EMR

Iteration 4
MBs
MBs
GBs
GBs
Parse
Patient Samples
(1000s)
Genes
(1000s)
Raw Signals
(millions)
Analysis Results
(millions)
Storage Query Analysis
Performance ✔ ✔ ✔
Cost ✔ ✔ ✔
Non-
clustered
indexes
is it a huge
analysis job?
yesnoanalysis
servers
Amazon
EMR
Get Insights

Instrument Run
(1000s)

Performance for a Project
2 4 7 11 13 20
30
0
50
100
150
200
250
300
350
90000 180000 270000 360000 450000 675000 900000
cloud
desktop
>10x
Desktop
Crash
minutes
# of records

Journey…
2
Identify scalable storage solution for large data items: :DynamoDB +
Amazon S3
3
Identify solutions for fast, real-time response & queries: DynamoDB +
Amazon S3 + ElastiCache
4 Identify solutions for real-time analysis of data: Amazon EMR
✓
✓
✓
✓
✓

Thermo Fisher Cloud
apps.thermofisher.com

Evolution of Our Architecture
More Reliable & Scalable

Road Blocks
• Unbalanced routing of analysis requests (asynchronous calls) by Elastic Load
Balancing (ELB)
 Supported algorithms by ELB
o Round robin routing
o Least outstanding requests routing
 Routing algorithm options are not memory or utilization based
• ElastiCache cluster
 Unreliable data eviction
 Connection pool overhead
 Need for robust fail-over mechanism

APPLICATION 2 SERVERS
APPLICATION 2
ANALYSIS SERVERSAPPLICATION 1 SERVERS
APPLICATION 1
User
APACHE
(staticcontent)
Client Internet
DNS Routing :
Amazon Route 53
CDN:
AWS CloudFront
DeploymentOverview
Application 1 /analysis/*Request to Application 1 Request to Application 2

Reliable Analysis Services: Initial Design
Analysis Requests
ELB

Analysis Services: Evolved Design with Queue
Analysis Requests
ELB
If analysis server crashes all queued jobs are lost.
Control over jobs is difficult
• managing them in dead queue
• Redirecting the jobs to different servers can’t
be done
ISSUES

Analysis Services: Evolved Design ― Amazon SQS
ELB
query query
• Analysis servers query the depth of SQS
• Accept jobs based on available capacity and execute them
• Scaling up & down based on jobs in SQS
• Can start with fewer analysis servers due to scaling
query

Amazon SQS Design
ELB
query query
V
query
Scaling down termination policies
• Oldest instance, newest instance, oldest launch
configuration, closest to next instance hour.
• Default: oldest launch configuration or close to
billing hour
Working with Amazon to enable
• Scaling down by memory usage in the servers using lifecycle hooks
• This will prevent termination of instances while analysis is happening
Termination can happen while
analysis is in progress

Gene Expression App
Cache Cluster Design
Cachecluster/Pool
Genotyping App App N…..
DynamoDB
AmazonS3
Availability Zone 1 Availability Zone 2

Reliable Cache Eviction: Initial Design
Auto eviction: OFF
Time to live (TTL): 60 minutes
• Some expired objects were not evicted
• Objects were building up
• Led to out-of-memory exceptions

Reliable Cache: With CRON Jobs
1. Collect all keys older than 7 days from slab
getKeys(){
local OLDER_THAN_DATE=${1}
##echo "getKeys(${OLDER_THAN_DATE})" >&2
mc_get_slab_list "$@" | while read -r i
do
mc_get_keys_for_slab -d ${OLDER_THAN_DATE} -s ${i}
done
}
echo "Begin deleting all old keys from all slabs"
while ((mc_get_keys | cut -d " " -f1 | xargs mc_gets | grep -v END | wc -l) > 0)
do
mc_get_keys | cut -d " " -f1 | xargs mc_gets | grep -v END | wc -l
done
echo "Done with deleting all old keys from all slabs"
2. Delete the objects that are expired
Thanks to AWS Support Team

Connection Pooling: Initial Design
clients clients clients
..…
• Multiple connections are open
• The cluster has to manage the connection overhead & memory
Cachecluster/Pool

mcrouter Connection Pooling…
• Multiple clients can connect to a mcrouter and share the outgoing connections.
• Reduction in the number of open connections to cached instances.
clients clients .… clients
mcrouter proxy

Reliable Cache Cluster
Auto-Failover for Critical Data

Classification of Data in Cache with mcrouter
Prefix routing of data in cache: route keys according to common key prefixes
mcrouter proxy
“CRITICAL”: sharded pool
Contains very critical data that is not recoverable
• Interim instrument run data while uploading
• Interim analysis results data while saving
• Project lock information
“non-critical”: sharded pool
Contains non-critical data that is recoverable from DB
• Thumbnail images of results
• Project permissions
• Default settings for analysis
Keys with “NON CRITICAL” pre-fixKeys with “CRITICAL” pre-fix
Pool “NON-CRITICAL” : shardedPool “CRITICAL” : sharded

Failover for CRITICAL Data Pool with mcrouter
mcrouter proxy
PRIMARY CACHE
REPLICATED
RE-ROUTE to “failover
cache” upon data miss
for that request
FAILOVER CACHE

RobustReliableScalable
ARCHITECTURE

BEYOND ANALYSIS: CLOUD-CONNECTED ECOSYSTEM
FOR GENETIC DISCOVERY

IoT in Life Sciences
Connected things : Instruments
Connected things : Products (RFID,NFC etc.)
Insightful analytics for faster discovery
Operational efficiency
Enhanced customer experience
Ciscohttp://blogs.cisco.com/diversity/the-internet-of-things-
infographic
http://blogs.cisco.com/diversity/the-internet-of-things-infographic

QuantStudio 3 & 5:
Cloud-Connected Platform
Connected laptop with
QuantStudio® 3 & 5
Data Collection
Software
LAN
Wi-Fi
Connect to the genetic analysis
cloud software using any deviceDownload instrument
run files
Download protocols
Upload run files

Proxy
(Gateway)
Instrument
Connect
Instrument
Details
https
Amazon S3Amazon RDS
Data
Manager
Deep Dive: IOT Architecture
END POINTS
for all instruments
https
https
Instruments within an
organization
Instrument type data
e.g., online manuals
Instrument data
Protocol, Sample, Gene,
Results
PULL DATA (protocols)
PUSH DATA (run)
REGISTER
INSTRUMENTS
INSTRUMENT DETAILS

Press Coverage
Chris Linthwaite,
President, Genetic Sciences
Thermo Fisher Scientific.
businesswire .com
American Association of Cancer Research
QuantStudio 3 & 5 Platform
one of top 7 new technologies
for cancer research @AACR
CE instruments connect to CDC
genomeweb.com

Enhanced workflows & results through predictive analysis & machine learning
Example: Improve diversity of training data set to identify mutations, active genes,
etc.
Trending on cumulative data for a user
Example: Use for better quality control, operator error, etc.
Internet of Things with connected instruments
Example: Predictive analysis of calibration issues, machine health, etc.
Predictive Analysis/Deep Learning …

Examples:
• Connect and collaborate with scientists to share experiment info
• Share or crowd-source troubleshooting among users
• Make your research & publications visible
• Donate your genomic data to public genetic databases
Empower Scientist
Social Connection

Open Ecosystem
SaaS Application Platform
• Open our ecosystem for third-party application developers
• Support analysis of data by different industry platforms and
standardized file format

Infectious disease caused by protozoa parasites of the Leishmania
genus (two strains: L. donovani, L. infantum)
Symptoms: skin ulcers and swelling of the liver and spleen
This disease is the second largest parasitic killer in the world
spread by sand fly (after malaria)
Annual occurrence rate (WHO): 300,000 new cases/year in Sudan,
Bangladesh, India, and Nepal
Fatality: 90% if untreated
Kala-azar, a.k.a. Visceral Leishmaniasis or Black Fever
30,000 deaths occur annually

Customer Success Story: Connected Ecosystem
Dr. Eisei Noiri
Associate professor
University of Tokyo Hospital
Collaborators
Bangladesh
Developed assays to identify different species of Leishmania for diagnosis
Collaborates in Bangladesh to run patient samples
Defined experiment run protocols & shared
protocols for running instrument
Collaborators can share
analysis results with Dr. Noiri
Quantstudio 5
Real-time PCR

Connected Ecosystem: Driving Adoption
outbreak
info
ALLinstrumentaroundtheworld
upload

Remember to complete
your evaluations!

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to (ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem

Similar to (ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(ARC311) Decoding The Genetic Blueprint Of Life On A Cloud Ecosystem