(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014
The document describes Thermo Fisher Scientific's iterative journey to build a scalable cloud platform using Amazon Web Services for storing and analyzing large scientific datasets. They started with DynamoDB for storage but added S3 to store larger objects and ElastiCache for faster queries. They also implemented Amazon EMR for real-time analysis, improving performance over 10x compared to desktop tools. The platform now enables analyzing millions of records within minutes to provide insights for scientific applications.
Getting Started with AWS Lambda and the Serverless Cloud - AWS Summit Cape T...
Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014
Strategic Uses for Cost Efficient Long-Term Cloud StorageAmazon Web Services
Similar to (HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014 (20)
What's New in Teams Calling, Meetings and Devices March 2024
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014
1. Puneet Suri, Thermo Fisher Scientific
Shakila Pothini, Thermo Fisher Scientific
Sami Zuhuruddin, Amazon Web Services
November 12, 2014 | Las Vegas, NV
HLS402
Getting into Your Genes: The definitive guide to using Amazon
EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-
Performance Scientific Applications
2. About me
Puneet Suri
Senior Director, Software Engineering
Life Sciences Group, Thermo Fisher Scientific
follow at: @psuriconnect at: puneet.suri@thermofisher.com
Envisionedanddeveloped the life sciences cloud platform for Thermo Fisher Scientific
4. Having an impact…
A person was set free after 35 years in prison because of a DNA test
Freeing the innocent
Surviving Cancer
A person survived pancreatic cancer thanks to a genetic approach that allowed an oncologist to focus on a specific cancer cell
Ebola
9. Our offerings
desktop applications
challenges with upgrade cycle, versions etc.
limited storage and compute capacity
to analyze complex & large data sets
no sharing & collaboration
no backup, archive & security
10. Abetter way… is to provide
STORAGE
COMPUTE
SCALABILITY
MEMORY
13. Aday with the scientist
Get Insights
aproject
*
*
*
*
14. Insights…
•what is causing cancer
•what drugs will work
•is therapy working
15. Customer pain points
•existing solutions cannot address the complexities
•excel is used painfully to manually analyze data
•multiple tools used to get the final insight
•it takes days to analyze the data
•some of the analysis workflow are not possible
16. Dimensions of complexity…
millions
of
records
thousands of users,
projects
real time analysis of large datasets
2-3 seconds response time
project
storage
compute
performance
scalability
18. Our iterative journey & challenges
0
start with reference architecture
1
identify scalable storage solution
2
identify scalable storage solution for large data items
3
identify solutionsfor real time response & queries
4
Identify solutions for real time analysis ofdata
19. Reference web architecture
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Amazon
RDS
MASTER
Amazon
RDS
STANDBY
Synchronous Replication
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
20. Why relational DB was not considered
•based on projected data and user growth over the years (hundreds of TBs), required real-time query performance very hard to achieve
•needed managed scalability without sharding/re-shardingoverhead and disruptions
•needed a loose schema to seamlessly enable new and cross domain workflows
21. Our iterative journey & challenges
0
start with reference architecture
1
identify scalable storage solution
2
identify scalable storage solution for large data items
3
identify solutionsfor real time response & queries
4
Identify solutions for real time analysis ofdata
22. NoSQL was the way to go
•managed scalability
•near zero administration overhead
•query performance not impacted by table sizecan add billions of rows
•simple and flexible schema –new domains can be supported
•extremely fast read/write performance
23. Architecture with DynamoDB
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
AmazonDynamoDB
24. What worked well with DynamoDB
Managed Service with flexible schema
Managed Scalability
Extremely fast access in order of milliseconds
READ/WRITE
26. What were the gaps
our item attribute (e.g.Instrument Run) size range > 400KB
(item attribute size limitation of 64KB400KB)
hot hash key& batch size limitations
•Adding thousands of related records (e.g. Raw Signals) with common hash key (e.g. Instrument Run) can be slow (10s seconds)
•a large project can have ~ 1 million records (e.g. Raw Signals) that needs to read & written
for a large project, high read/write capacity (1000s) was needed
(increased cost due to high READ/WRITE capacity needs)
27. What we needed
Asolution that
•can store huge number of related objects
•is cost effective to read/write large data sets
•has no limitations on batch size or item size
•ability to query into the large number of records
28. Our iterative journey & challenges
0
start with reference architecture
1
identify scalable storage solution : DynamoDB
2
identify scalable storage solution for large data items
3
identify solutionsfor real time response & queries
4
Identify solutions for real time analysis ofdata
29. Architecture with DynamoDB & S3
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
31. Architecture with DynamoDB & S3
•DynamoDB was used to store small unrelated objects (KB)
•will grow to a large number (e.g. Data Files)
•Amazon S3 was used to store related larger objects (e.g. Raw Signals & Analysis Results (GB))
•stored as single Amazon S3 object serialized using google protobuf
•Amazon S3 was cost effective for storing huge objects
33. What we needed
•complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user
•visualizations are very interactive that requires constant update of data. Need quick read & writes
•support concurrent access without any degradation in query performance
34. Our iterative journey & challenges
0
start with reference architecture
1
identify scalable storage solution : DynamoDB
2
identify scalable storage solution for large data items :DynamoDB + AmazonS3
3
identify solutionsfor fast real time response & queries
3
Identify solutions for real time analysis ofdata
35. Distributed in-memory storage was the way to go
read/writes have to be quick to enable fast response times, reading & writing from Amazon S3 was not ideal.
•ElastiCachewas used as IN-MEMORY storage on top of DynamoDB& Amazon S3.
•all related serialized objects in Amazon S3 accessed by customers is maintained in ElastiCacheas individual records
•Indexes created in DynamoDB based on the query pattern so that data can be easily retrieved from ElastiCache
36. Architecture with DynamoDB, Amazon S3 & ElastiCache
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
38. Need for real time data analysis
•analyze huge projects containing thousands of patient samples in minutes instead of days
•a scalable solution is required to support analysis requests from thousands of users
•existing desktop algorithms used for this analysis not optimized for extracting parallelism in data
41. Our iterative journey & challenges
0
startwith reference architecture
1
identify scalable storage solution : DynamoDB
2
identify scalable storage solution for large data items : DynamoDB + Amazon S3
3
identify solutionsfor fast real time response & queries : DynamoDB + Amazon S3 + ElastiCache
4
Identify solutions for real time analysis ofdata
42. Amazon EMR was the way to go
1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of days
2.all small jobs analyzed in-memory while big ones are sent toAmazon EMR.
3.existing algorithms overhauled to derive massive parallelism using Hadoopmap-reduce framework
4.as large datasets already in Amazon S3, used Amazon S3 for input and output instead of HDFS –only intermediate map-reduce data in HDFS
5.Amazon EMR cluster is created On-Demand and shutdown when done
43. Architecture with EMR for real time
analysis
A
B
User
Client
Internet
DNS Routing :
Route 53
AUTO SCALING
WEB SERVERS
AUTO SCALING
APP SERVERS
Load
Balancers
Load Balancers
WEB SERVERS
CDN:
CloudFront
APP SERVERS
Auto Scaling
DynamoDB
Amazon S3
ElastiCache
EMR
48. About me : Shakila Pothini
Senior Manger, Cloud Apps
Life Sciences Group,
Thermo Fisher Scientific
Hiking is my ONLY stress buster
Entertain to Educate.
Cofounder of performing arts group (swaram.org)
Mostly left brained with
occasional sense of creativity
*
*
*
49. How to get into your gene?
sequence the human entire transcriptome (30,000 genes)
identify significant genes
(100+ genes)
validate & reconfirm the
(20+ genes)
do it on more samples & different population
find the way the genes interplay in the pathway
understand cancer diversity.
types of therapy.
drug-able genes.
51. Demo summary
non cancerous sample
cancerous sample
difference in expression of genes
52. Customer feedback
“My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ analysis was fast and user friendly”
UPENN
“I enjoy processing hundreds of open array plates with ease.”, “I appreciate the rapid access of the large number of amplification curves ”
Sanofi
“I wanted to let you know that Symphonihas been working well for me. I have done analysis using as high as 500 files. ”
ASU
“This I see value in... utilizing these features. I appreciate the speed of data processing and visuals.”
LUMC
55. A few years from now : every person
ATGCATGCTATCAATTGCCC
Sequence
melanoma
healthrisks
drug response
powered by AWS
lifecloud
56. Yearly check-ups a few years from now
ATGCATGC ATTGCCC
ATGCATGC ATTGCCC
TATCA
GCATG
lifecloud
ATGCATGCTATCAATTGCCC
Sequence
57. Yearly check-ups a few years from now (cont’d)
cancer
any clinical
trial?
healthrisks
drug response
ATGCATGCTATCAATTGCCC
Sequence
lifecloud
prescribe the
right drug