Thermo Fisher Scientific, a world leader in biotechnology, has built a new polymerase chain reaction (PCR) system for DNA sequencing. Designed for low- to midlevel throughput laboratories that conduct real time PCR experiments, the system runs on individual QuantStudio devices. These devices are connected to Thermo Fisher’s cloud computing platform, which is built on AWS using Amazon EC2, Amazon DynamoDB, and Amazon S3. With this single platform, applied and clinical researchers can learn, analyze, share, collaborate, and obtain support. Researchers worldwide can now collaborate online in real time and access their data wherever and whenever necessary. Laboratories can also share experimental conditions and results with their partners while providing a uniform experience for every user and helping to minimize training and errors. The net result is increased collaboration, faster time to market, fewer errors, and lower cost. We have architected a solution that uses Amazon EMR, DynamoDB, Amazon Elasticache, and S3. In this presentation, we share our architecture, lessons learned, best design patterns for NoSQL, strategies for leveraging EMR with DynamoDB, and a flexible solution that our scientist use. We also share our next step in architecture evolution.
2. About Me
Puneet Suri
Senior Director,Software Engineering
Life Sciences Group,Thermo Fisher Scientific
follow at: @psuri connect at: puneet.suri@thermofisher.com
Envisioned and developed the life sciences cloud platformfor Thermo Fisher
Scientific
3. 4.5 hours of TV
11 hours of online reading
5 hours of radio
2 hours of computer
5 hours of handheld
1.13 hours of phone
4. Earth to Moon
6000 times
Earth to Sun
30 times
DNA from all cells
37.2
Trillion
Cells in our bodyDNA in single cell
6 feet long
Human genome
200
Gigabytes
Population of 7 billion
1.4
Zettabytes
Basepairs in DNA
3.2 billion
2 Bytes/basepair
ATGC
8. Having an Impact…
A person was set free
after 35 years in prison
because of a DNA test
Freeing the innocent
H1N1 pandemic
Ebola
CDC : Swine flu viruses in U.S and Mexico match
7500 Instrument
9. What to Expect from the Session
• Refresh / overview of our architecture
• Evolution of the architecture
• Connected ecosystem
• Future possibilities
10. A Day with the Scientist
Get Insights
a project
* * * *
11. Insights
• What is causing cancer?
• Is the disease a genetic hereditary?
• What drugs will work?
• Is therapy working?
14. desktop apps no backup, archive, or securitylimited analysis limited collaboration
ourofferings
multiple tools
painpoints
slow analysis unsupported workflowsExcel nightmare
15. Real-time analysis
highly interactive visualizations
2-3 second response time
easy to share data &
collaborate
store & manage large
scientific data sets
Customerneeds…
16. A Better Way Is to Provide…
STORAGE
COMPUTE
SCALABILITY
MEMORY
20. Custom Storage Engine
To meet the challenges of storage, compute,
performance & scalability, a custom storage and query
engine had to be designed and implemented
21. Design Goals for Our Custom Storage Engine
• High availability (HA) out of the box – no config
needed to turn on HA
• Completely decentralized, so no SPOF
• Managed scalability such that no admin input
required to scale storage engine as data volume
and concurrent requests go up
• Manageable total cost of ownership (TCO)
22. Custom Storage Engine Design
Redundant S3 storage
for big data sets
Amazon S3
DynamoDB for storing
indexes and small objects
DynamoDB
In-memory caching
for faster access using
index-based retrieval
ElastiCache
23. Our Iterative Journey & Challenges
0 Start with reference architecture
1 Identify scalable storage solution
2 identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data
24. NoSQL (DynamoDB)
• Managed scalability
• Near zero administration overhead
• Query performance not impacted by table size;
can add billions of rows
• Key value store allows for flexibility, so new
domains can be supported
• Read/write performance in order of tens of ms
25. Architecture with DynamoDB
A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
Amazon
DynamoDB
27. What Were the Gaps?
Our item attribute (e.g., Instrument Run) size range > 400 KB
(item attribute size limitation of 400 KB)
Hot hash key
• Adding thousands of related records (e.g., Raw Signals) with common hash
key (e.g., Instrument Run) can be slow (10s seconds)
For a large project, high read/write capacity (1000s) was needed
(increased cost due to high read/write capacity needs)
Batch size limitation of 25
• A large project can have ~1 million records (e.g., Raw
Signals) that needs to be read & written at the same time
28. What We Needed…
…was a solution that
• can store a huge number of related objects
• is cost effective for reading/writing large data sets
• has no limitations on batch size or item size
• has ability to query into the large number of records
29. Our Iterative Journey & Challenges
0 Start with reference architecture
1 Identify scalable storage solution: DynamoDB
2 Identify scalable storage solution for large data items
3 identify solutions for real time response & queries
4 Identify solutions for real time analysis of data
30. A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
APP SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Architecture : DynamoDB & Amazon S3
Amazon S3
33. Our Iterative Journey & Challenges
0 Start with reference architecture
1 Identify scalable storage solution: DynamoDB
2
Identify scalable storage solution for large data items : DynamoDB +
Amazon S3
3 Identify solutions for fast, real-time response & queries
4 Identify solutions for real time analysis of data
34. Distributed In-Memory Storage: ElastiCache
• Queries have to perform in the order of milliseconds, so reading
& writing from S3 was not feasible for interactive use cases.
• ElastiCache was used as IN-MEMORY storage on top of
DynamoDB & Amazon S3.
• All related serialized objects in Amazon S3 part of a query
access pattern are maintained in ElastiCache as individual
records.
• Non-clustered indexes were created in DynamoDB based on the
query pattern so that data could be efficiently retrieved from
ElastiCache.
35. A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Architecture DynamoDB, Amazon S3 & ElastiCache
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
37. Need for Real-time Data Analysis
• Analyze huge projects containing thousands of patient samples in
minutes instead of days
• A scalable solution is required to support analysis requests from
thousands of users
• Existing desktop algorithms used for this analysis are not optimized
for extracting parallelism in data
39. Our Iterative Journey & Challenges
0 Start with reference architecture
1 Identify scalable storage solution: DynamoDB
2
Identify scalable storage solution for large data items: DynamoDB +
Amazon S3
3
Identify solutions for fast, real-time response & queries: DynamoDB +
Amazon S3 + ElastiCache
4 Identify solutions for real-time analysis of data
40. Amazon EMR
• Amazon EMR was used to perform real-time analysis of huge data sets
(analysis results) in minutes instead of days.
• All small jobs analyzed in memory while big ones are sent to Amazon EMR.
• Existing algorithms overhauled to derive massive parallelism using Hadoop
map-reduce framework.
• As large data sets were already in Amazon S3, used Amazon S3 for input
and output instead of HDFS. Only intermediate map-reduce data in HDFS.
• Amazon EMR cluster is created on demand and shut down when done.
41. A
B
User
Client
Internet
DNS Routing :
Amazon Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load balancers
WEB SERVERS
CDN:
Amazon CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
Architecture : Amazon EMR for Real-time Analysis
EMR
47. Road Blocks
• Unbalanced routing of analysis requests (asynchronous calls) by Elastic Load
Balancing (ELB)
Supported algorithms by ELB
o Round robin routing
o Least outstanding requests routing
Routing algorithm options are not memory or utilization based
• ElastiCache cluster
Unreliable data eviction
Connection pool overhead
Need for robust fail-over mechanism
48. APPLICATION 2 SERVERS
APPLICATION 2
ANALYSIS SERVERSAPPLICATION 1 SERVERS
APPLICATION 1
User
APACHE
(staticcontent)
Client Internet
DNS Routing :
Amazon Route 53
CDN:
AWS CloudFront
DeploymentOverview
Application 1 /analysis/*Request to Application 1 Request to Application 2
51. Analysis Services: Evolved Design with Queue
Analysis Requests
ELB
If analysis server crashes all queued jobs are lost.
Control over jobs is difficult
• managing them in dead queue
• Redirecting the jobs to different servers can’t
be done
ISSUES
52. Analysis Services: Evolved Design ― Amazon SQS
ELB
query query
• Analysis servers query the depth of SQS
• Accept jobs based on available capacity and execute them
• Scaling up & down based on jobs in SQS
• Can start with fewer analysis servers due to scaling
query
53. Amazon SQS Design
ELB
query query
V
query
Scaling down termination policies
• Oldest instance, newest instance, oldest launch
configuration, closest to next instance hour.
• Default: oldest launch configuration or close to
billing hour
Working with Amazon to enable
• Scaling down by memory usage in the servers using lifecycle hooks
• This will prevent termination of instances while analysis is happening
Termination can happen while
analysis is in progress
55. Gene Expression App
Cache Cluster Design
Cachecluster/Pool
Genotyping App App N…..
DynamoDB
AmazonS3
Availability Zone 1 Availability Zone 2
56. Reliable Cache Eviction: Initial Design
Auto eviction: OFF
Time to live (TTL): 60 minutes
• Some expired objects were not evicted
• Objects were building up
• Led to out-of-memory exceptions
57. Reliable Cache: With CRON Jobs
1. Collect all keys older than 7 days from slab
getKeys(){
local OLDER_THAN_DATE=${1}
##echo "getKeys(${OLDER_THAN_DATE})" >&2
mc_get_slab_list "$@" | while read -r i
do
mc_get_keys_for_slab -d ${OLDER_THAN_DATE} -s ${i}
done
}
echo "Begin deleting all old keys from all slabs"
while ((mc_get_keys | cut -d " " -f1 | xargs mc_gets | grep -v END | wc -l) > 0)
do
mc_get_keys | cut -d " " -f1 | xargs mc_gets | grep -v END | wc -l
done
echo "Done with deleting all old keys from all slabs"
2. Delete the objects that are expired
Thanks to AWS Support Team
58. Connection Pooling: Initial Design
clients clients clients
..…
• Multiple connections are open
• The cluster has to manage the connection overhead & memory
Cachecluster/Pool
59. mcrouter Connection Pooling…
• Multiple clients can connect to a mcrouter and share the outgoing connections.
• Reduction in the number of open connections to cached instances.
clients clients .… clients
mcrouter proxy
61. Classification of Data in Cache with mcrouter
Prefix routing of data in cache: route keys according to common key prefixes
mcrouter proxy
“CRITICAL”: sharded pool
Contains very critical data that is not recoverable
• Interim instrument run data while uploading
• Interim analysis results data while saving
• Project lock information
“non-critical”: sharded pool
Contains non-critical data that is recoverable from DB
• Thumbnail images of results
• Project permissions
• Default settings for analysis
Keys with “NON CRITICAL” pre-fixKeys with “CRITICAL” pre-fix
Pool “NON-CRITICAL” : shardedPool “CRITICAL” : sharded
62. Failover for CRITICAL Data Pool with mcrouter
mcrouter proxy
PRIMARY CACHE
REPLICATED
RE-ROUTE to “failover
cache” upon data miss
for that request
Availability Zone 1 Availability Zone 2
FAILOVER CACHE
Availability Zone 1 Availability Zone 2
65. IoT in Life Sciences
Connected things : Instruments
Connected things : Products (RFID,NFC etc.)
Insightful analytics for faster discovery
Operational efficiency
Enhanced customer experience
Ciscohttp://blogs.cisco.com/diversity/the-internet-of-things-
infographic
http://blogs.cisco.com/diversity/the-internet-of-things-infographic
66. QuantStudio 3 & 5:
Cloud-Connected Platform
Connected laptop with
QuantStudio® 3 & 5
Data Collection
Software
LAN
Wi-Fi
Connect to the genetic analysis
cloud software using any deviceDownload instrument
run files
Download protocols
Upload run files
69. Press Coverage
Chris Linthwaite,
President, Genetic Sciences
Thermo Fisher Scientific.
businesswire .com
American Association of Cancer Research
QuantStudio 3 & 5 Platform
one of top 7 new technologies
for cancer research @AACR
CE instruments connect to CDC
genomeweb.com
71. Enhanced workflows & results through predictive analysis & machine learning
Example: Improve diversity of training data set to identify mutations, active genes,
etc.
Trending on cumulative data for a user
Example: Use for better quality control, operator error, etc.
Internet of Things with connected instruments
Example: Predictive analysis of calibration issues, machine health, etc.
Predictive Analysis/Deep Learning …
72. Examples:
• Connect and collaborate with scientists to share experiment info
• Share or crowd-source troubleshooting among users
• Make your research & publications visible
• Donate your genomic data to public genetic databases
Empower Scientist
Social Connection
73. Open Ecosystem
SaaS Application Platform
• Open our ecosystem for third-party application developers
• Support analysis of data by different industry platforms and
standardized file format
75. Infectious disease caused by protozoa parasites of the Leishmania
genus (two strains: L. donovani, L. infantum)
Symptoms: skin ulcers and swelling of the liver and spleen
This disease is the second largest parasitic killer in the world
spread by sand fly (after malaria)
Annual occurrence rate (WHO): 300,000 new cases/year in Sudan,
Bangladesh, India, and Nepal
Fatality: 90% if untreated
Kala-azar, a.k.a. Visceral Leishmaniasis or Black Fever
30,000 deaths occur annually
76. Customer Success Story: Connected Ecosystem
Dr. Eisei Noiri
Associate professor
University of Tokyo Hospital
Collaborators
Bangladesh
Developed assays to identify different species of Leishmania for diagnosis
Collaborates in Bangladesh to run patient samples
Defined experiment run protocols & shared
protocols for running instrument
Collaborators can share
analysis results with Dr. Noiri
Quantstudio 5
Real-time PCR