Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Puneet Suri, Thermo Fisher Scientific 
Shakila Pothini, Thermo Fisher Scientific 
Sami Zuhuruddin, Amazon Web Services 
No...
About me 
Puneet Suri 
Senior Director, Software Engineering 
Life Sciences Group, Thermo Fisher Scientific 
follow at: @p...
This is why we are here…
Having an impact… 
A person was set free after 35 years in prison because of a DNA test 
Freeing the innocent 
Surviving C...
H1N1: Pandemic declared in April 2009
Need to enable this at larger scale & impact more lives
Customer needs… 
store & manage large scientific data sets
A few years back
Our offerings 
desktop applications 
challenges with upgrade cycle, versions etc. 
limited storage and compute capacity 
t...
Abetter way… is to provide 
STORAGE 
COMPUTE 
SCALABILITY 
MEMORY
Our vision
Adeep dive into our story
Aday with the scientist 
Get Insights 
 
 
 
 
aproject 
* 
* 
* 
*
Insights… 
•what is causing cancer 
•what drugs will work 
•is therapy working
Customer pain points 
•existing solutions cannot address the complexities 
•excel is used painfully to manually analyze da...
Dimensions of complexity… 
millions 
of 
records 
thousands of users, 
projects 
real time analysis of large datasets 
2-3...
Our journey enabling complex customer workflows
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution 
2 
identi...
Reference web architecture 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING...
Why relational DB was not considered 
•based on projected data and user growth over the years (hundreds of TBs), required ...
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution 
2 
identi...
NoSQL was the way to go 
•managed scalability 
•near zero administration overhead 
•query performance not impacted by tabl...
Architecture with DynamoDB 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SCALING...
What worked well with DynamoDB 
Managed Service with flexible schema 
Managed Scalability 
Extremely fast access in order ...
Iteration 1 
GBs 
GBs 
MBs 
MBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions...
What were the gaps 
our item attribute (e.g.Instrument Run) size range > 400KB 
(item attribute size limitation of 64KB400...
What we needed 
Asolution that 
•can store huge number of related objects 
•is cost effective to read/write large data set...
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB...
Architecture with DynamoDB & S3 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SERVERS 
AUTO SC...
MBs 
MBs 
GBs 
Iteration 2 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions...
Architecture with DynamoDB & S3 
•DynamoDB was used to store small unrelated objects (KB) 
•will grow to a large number (e...
Real time queries for complex visualizations
What we needed 
•complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user 
•vi...
Our iterative journey & challenges 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB...
Distributed in-memory storage was the way to go 
read/writes have to be quick to enable fast response times, reading & wri...
Architecture with DynamoDB, Amazon S3 & ElastiCache 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
...
Iteration 3 
MBs 
MBs 
GBs 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions...
Need for real time data analysis 
•analyze huge projects containing thousands of patient samples in minutes instead of day...
8 
20 
40 
80 
120 
200 
320 
0 
50 
100 
150 
200 
250 
300 
350 
90000 
180000 
270000 
360000 
450000 
675000 
900000 
...
Excel nightmare
Our iterative journey & challenges 
0 
startwith reference architecture 
1 
identify scalable storage solution : DynamoDB ...
Amazon EMR was the way to go 
1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of...
Architecture with EMR for real time 
analysis 
A 
B 
User 
Client 
Internet 
DNS Routing : 
Route 53 
AUTO SCALING 
WEB SE...
Iteration 4 
MBs 
MBs 
GBs 
GBs 
Instrument Run 
(1000s) 
Patient Samples 
(1000s) 
Genes 
(1000s) 
Raw Signals 
(millions...
Performance for a project 
2 
4 
7 
11 
13 
20 
30 
0 
50 
100 
150 
200 
250 
300 
350 
90000 
180000 
270000 
360000 
45...
Journey 
0 
start with reference architecture 
1 
identify scalable storage solution : DynamoDB 
2 
identify scalable stor...
Learnings 
• 
• 
• 
• 
• 
•
About me : Shakila Pothini 
Senior Manger, Cloud Apps 
Life Sciences Group, 
Thermo Fisher Scientific 
Hiking is my ONLY s...
How to get into your gene? 
sequence the human entire transcriptome (30,000 genes) 
identify significant genes 
(100+ gene...
Demo
Demo summary 
non cancerous sample 
cancerous sample 
difference in expression of genes
Customer feedback 
“My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ ...
Yearly checkup today 
165 / 105 
120 
50 / 90 
104
Is this really going to detect early stages of cancer?
A few years from now : every person 
ATGCATGCTATCAATTGCCC 
Sequence 
melanoma 
healthrisks 
drug response 
powered by AWS ...
Yearly check-ups a few years from now 
ATGCATGC ATTGCCC 
ATGCATGC ATTGCCC 
TATCA 
GCATG 
lifecloud 
ATGCATGCTATCAATTGCCC 
...
Yearly check-ups a few years from now (cont’d) 
cancer 
any clinical 
trial? 
healthrisks 
drug response 
ATGCATGCTATCAATT...
Puneet Suri 
Senior Director, Software engineering 
puneet.suri@thermofisher.comT: 650.266.5857 @psuri 
ShakilaPothini 
...
Collect / 
Ingest 
Kinesis 
Process / Analyze 
EMR 
EC2 
Redshift 
Data Pipeline 
Visualize / 
Report 
Glacier 
S3 
Dynamo...
Experiment 1 
Data Access 
Compute Time 
Experiment 2 
Experiment 3 
Data Access 
Compute Time 
Data Access 
Compute Time ...
EMR 
Cluster 
EC2 
Instance 
Data Temperature
Please give us your feedback on this session. 
Complete session evaluations and earn re:Invent swag. 
http://bit.ly/awseva...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver H...
(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver H...
Upcoming SlideShare
Loading in …5
×

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

1,663 views

Published on

The key to fighting cancer through better therapeutics is a deep understanding of the basic biology of this disease at a cellular and molecular level. Comprehensive analysis of cancer mutations in specific tumors or cancer cell lines by using Life Technologies sequencing and real-time PCR systems generates gigabytes to terabytes of data every day. Our customers bring together this data in studies that seek to discover the genetic fingerprint of cancer. The data typically translates to millions of records in databases that require complex algorithmic processing, cross-application analysis, and interactive visualizations with real-time response (2-3 seconds) to enable users to consume large volumes of complex scientific information.
We have chosen the AWS platform to bring this new era of data analysis power to our customers by using technologies such as Amazon S3, ElastiCache, and DynamoDB for storage and fast access and Amazon EMR for parallelizing complex computations. Our talk tells the story with rich details about challenges and roadblocks in building data-intense, highly interactive applications in the cloud. We also highlight enhanced customer workflows and highly optimized applications with orders of magnitude improvement in performance and scalability.

Published in: Technology
  • Be the first to comment

(HLS402) Getting into Your Genes: The Definitive Guide to Using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High-Performance, Scientific Applications | AWS re:Invent 2014

  1. 1. Puneet Suri, Thermo Fisher Scientific Shakila Pothini, Thermo Fisher Scientific Sami Zuhuruddin, Amazon Web Services November 12, 2014 | Las Vegas, NV HLS402 Getting into Your Genes: The definitive guide to using Amazon EMR, Amazon ElastiCache, and Amazon S3 to Deliver High- Performance Scientific Applications
  2. 2. About me Puneet Suri Senior Director, Software Engineering Life Sciences Group, Thermo Fisher Scientific follow at: @psuriconnect at: puneet.suri@thermofisher.com Envisionedanddeveloped the life sciences cloud platform for Thermo Fisher Scientific
  3. 3. This is why we are here…
  4. 4. Having an impact… A person was set free after 35 years in prison because of a DNA test Freeing the innocent Surviving Cancer A person survived pancreatic cancer thanks to a genetic approach that allowed an oncologist to focus on a specific cancer cell Ebola
  5. 5. H1N1: Pandemic declared in April 2009
  6. 6. Need to enable this at larger scale & impact more lives
  7. 7. Customer needs… store & manage large scientific data sets
  8. 8. A few years back
  9. 9. Our offerings desktop applications challenges with upgrade cycle, versions etc. limited storage and compute capacity to analyze complex & large data sets no sharing & collaboration no backup, archive & security
  10. 10. Abetter way… is to provide STORAGE COMPUTE SCALABILITY MEMORY
  11. 11. Our vision
  12. 12. Adeep dive into our story
  13. 13. Aday with the scientist Get Insights     aproject * * * *
  14. 14. Insights… •what is causing cancer •what drugs will work •is therapy working
  15. 15. Customer pain points •existing solutions cannot address the complexities •excel is used painfully to manually analyze data •multiple tools used to get the final insight •it takes days to analyze the data •some of the analysis workflow are not possible
  16. 16. Dimensions of complexity… millions of records thousands of users, projects real time analysis of large datasets 2-3 seconds response time project storage compute performance scalability
  17. 17. Our journey enabling complex customer workflows
  18. 18. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  19. 19. Reference web architecture A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Amazon RDS MASTER Amazon RDS STANDBY Synchronous Replication Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS
  20. 20. Why relational DB was not considered •based on projected data and user growth over the years (hundreds of TBs), required real-time query performance very hard to achieve •needed managed scalability without sharding/re-shardingoverhead and disruptions •needed a loose schema to seamlessly enable new and cross domain workflows
  21. 21. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  22. 22. NoSQL was the way to go •managed scalability •near zero administration overhead •query performance not impacted by table sizecan add billions of rows •simple and flexible schema –new domains can be supported •extremely fast read/write performance
  23. 23. Architecture with DynamoDB A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling AmazonDynamoDB
  24. 24. What worked well with DynamoDB Managed Service with flexible schema Managed Scalability Extremely fast access in order of milliseconds READ/WRITE
  25. 25. Iteration 1 GBs GBs MBs MBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ Get Insights     project
  26. 26. What were the gaps our item attribute (e.g.Instrument Run) size range > 400KB (item attribute size limitation of 64KB400KB) hot hash key& batch size limitations •Adding thousands of related records (e.g. Raw Signals) with common hash key (e.g. Instrument Run) can be slow (10s seconds) •a large project can have ~ 1 million records (e.g. Raw Signals) that needs to read & written for a large project, high read/write capacity (1000s) was needed (increased cost due to high READ/WRITE capacity needs)
  27. 27. What we needed Asolution that •can store huge number of related objects •is cost effective to read/write large data sets •has no limitations on batch size or item size •ability to query into the large number of records
  28. 28. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items 3 identify solutionsfor real time response & queries 4 Identify solutions for real time analysis ofdata
  29. 29. Architecture with DynamoDB & S3 A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3
  30. 30. MBs MBs GBs Iteration 2 GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ Get Insights    
  31. 31. Architecture with DynamoDB & S3 •DynamoDB was used to store small unrelated objects (KB) •will grow to a large number (e.g. Data Files) •Amazon S3 was used to store related larger objects (e.g. Raw Signals & Analysis Results (GB)) •stored as single Amazon S3 object serialized using google protobuf •Amazon S3 was cost effective for storing huge objects
  32. 32. Real time queries for complex visualizations
  33. 33. What we needed •complex visualizations requiresGigabytes of data to be queried in 2-3 secs and presented to the user •visualizations are very interactive that requires constant update of data. Need quick read & writes •support concurrent access without any degradation in query performance
  34. 34. Our iterative journey & challenges 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items :DynamoDB + AmazonS3 3 identify solutionsfor fast real time response & queries 3 Identify solutions for real time analysis ofdata
  35. 35. Distributed in-memory storage was the way to go read/writes have to be quick to enable fast response times, reading & writing from Amazon S3 was not ideal. •ElastiCachewas used as IN-MEMORY storage on top of DynamoDB& Amazon S3. •all related serialized objects in Amazon S3 accessed by customers is maintained in ElastiCacheas individual records •Indexes created in DynamoDB based on the query pattern so that data can be easily retrieved from ElastiCache
  36. 36. Architecture with DynamoDB, Amazon S3 & ElastiCache A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3 ElastiCache
  37. 37. Iteration 3 MBs MBs GBs GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Performance ✔ ✔ Cost ✔ ✔ indexes Get Insights    
  38. 38. Need for real time data analysis •analyze huge projects containing thousands of patient samples in minutes instead of days •a scalable solution is required to support analysis requests from thousands of users •existing desktop algorithms used for this analysis not optimized for extracting parallelism in data
  39. 39. 8 20 40 80 120 200 320 0 50 100 150 200 250 300 350 90000 180000 270000 360000 450000 675000 900000 desktop desktop Analysis solutions in desktop desktop crashes minutes # of records Get Insights    
  40. 40. Excel nightmare
  41. 41. Our iterative journey & challenges 0 startwith reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items : DynamoDB + Amazon S3 3 identify solutionsfor fast real time response & queries : DynamoDB + Amazon S3 + ElastiCache 4 Identify solutions for real time analysis ofdata
  42. 42. Amazon EMR was the way to go 1.EMR was used to performreal time analysisof huge data sets – results in minutes instead of days 2.all small jobs analyzed in-memory while big ones are sent toAmazon EMR. 3.existing algorithms overhauled to derive massive parallelism using Hadoopmap-reduce framework 4.as large datasets already in Amazon S3, used Amazon S3 for input and output instead of HDFS –only intermediate map-reduce data in HDFS 5.Amazon EMR cluster is created On-Demand and shutdown when done
  43. 43. Architecture with EMR for real time analysis A B User Client Internet DNS Routing : Route 53 AUTO SCALING WEB SERVERS AUTO SCALING APP SERVERS Load Balancers Load Balancers WEB SERVERS CDN: CloudFront APP SERVERS Auto Scaling DynamoDB Amazon S3 ElastiCache EMR
  44. 44. Iteration 4 MBs MBs GBs GBs Instrument Run (1000s) Patient Samples (1000s) Genes (1000s) Raw Signals (millions) Analysis Results (millions) Storage Query Analysis Performance ✔ ✔ ✔ Cost ✔ ✔ ✔ Get Insights    
  45. 45. Performance for a project 2 4 7 11 13 20 30 0 50 100 150 200 250 300 350 90000 180000 270000 360000 450000 675000 900000 cloud desktop >10x crashes minutes # of records
  46. 46. Journey 0 start with reference architecture 1 identify scalable storage solution : DynamoDB 2 identify scalable storage solution for large data items : DynamoDB + Amazon S3 3 identify solutionsfor fast real time response & queries : DynamoDB+ Amazon S3 + ElastiCache 4 Identify solutions for real time analysis ofdata : Amazon EMR ✓ ✓ ✓ ✓ ✓
  47. 47. Learnings • • • • • •
  48. 48. About me : Shakila Pothini Senior Manger, Cloud Apps Life Sciences Group, Thermo Fisher Scientific Hiking is my ONLY stress buster Entertain to Educate. Cofounder of performing arts group (swaram.org) Mostly left brained with occasional sense of creativity * * *
  49. 49. How to get into your gene? sequence the human entire transcriptome (30,000 genes) identify significant genes (100+ genes) validate & reconfirm the (20+ genes) do it on more samples & different population find the way the genes interplay in the pathway understand cancer diversity. types of therapy. drug-able genes.
  50. 50. Demo
  51. 51. Demo summary non cancerous sample cancerous sample difference in expression of genes
  52. 52. Customer feedback “My initial SymphoniSuite evaluation experience was good, GUI/ controls are intuitive and data upload/ analysis was fast and user friendly” UPENN “I enjoy processing hundreds of open array plates with ease.”, “I appreciate the rapid access of the large number of amplification curves ” Sanofi “I wanted to let you know that Symphonihas been working well for me. I have done analysis using as high as 500 files. ” ASU “This I see value in... utilizing these features. I appreciate the speed of data processing and visuals.” LUMC
  53. 53. Yearly checkup today 165 / 105 120 50 / 90 104
  54. 54. Is this really going to detect early stages of cancer?
  55. 55. A few years from now : every person ATGCATGCTATCAATTGCCC Sequence melanoma healthrisks drug response powered by AWS lifecloud
  56. 56. Yearly check-ups a few years from now ATGCATGC ATTGCCC ATGCATGC ATTGCCC TATCA GCATG lifecloud ATGCATGCTATCAATTGCCC Sequence
  57. 57. Yearly check-ups a few years from now (cont’d) cancer any clinical trial? healthrisks drug response ATGCATGCTATCAATTGCCC Sequence lifecloud prescribe the right drug
  58. 58. Puneet Suri Senior Director, Software engineering puneet.suri@thermofisher.comT: 650.266.5857 @psuri ShakilaPothini Senior Manager, Cloud Applications shakila.pothini@thermofisher.com SalilKumar Cloud Architect T: 650.740.1646 @salilkum
  59. 59. Collect / Ingest Kinesis Process / Analyze EMR EC2 Redshift Data Pipeline Visualize / Report Glacier S3 DynamoDB Store RDS Data Answers
  60. 60. Experiment 1 Data Access Compute Time Experiment 2 Experiment 3 Data Access Compute Time Data Access Compute Time ✔ ✔ ✔
  61. 61. EMR Cluster EC2 Instance Data Temperature
  62. 62. Please give us your feedback on this session. Complete session evaluations and earn re:Invent swag. http://bit.ly/awsevals Puneet Suri Senior Director, Software engineering puneet.suri@thermofisher.com T: 650.266.5857 @psuri Shakila Pothini Senior Manager, Cloud Applications shakila.pothini@thermofisher.com T: 650.554.2190 Salil Kumar Cloud Architect T: 650.740.1646 @salilkum

×