Scien&fic	
  Compu&ng	
  with	
  Amazon	
  Web	
  Services
Deepak	
  Singh




Cloud	
  Futures	
  2010
Via Reavel under a CC-BY-NC-ND license
life science industry
Credit: Bosco Ho
By ~Prescott under a CC-BY-NC license
data
Image: Wikipedia
Image	
  via	
  image	
  editor	
  under	
  a	
  CC-­‐BY	
  License
Image: Matt Wood
Image: NOAA
gigabytes
terabytes
petabytes
petabytes
ex
         ab
            y   tes
petabytes             ?
really fast
Image: http://www.broadinstitute.org/~apleite/photos.html
data management
data processing
data sharing
Image: Chris Dagdigian
compute & storage limited
amazon web services
the cloud
has_many :definitions
infrastructure as a service
Compute                          Storage
    Amazon Elastic Compute                                   Database
           ...
Messaging
                                                      Amazon Simple            Payments          On-Demand
Paral...
Tools                  Isolated Networks
         Monitoring                    Management
                               ...
Your Custom Applications and Services

                                                                            Tools  ...
elasticity
3000 CPU’s for one firm’s risk management application
     3444JJ'
!"#$%&'()'*+,'-./01.2%/'




                          ...
scalability
> 1PB of data in S3
highly available
Image: Chris Dagdigian
“Everything fails, all the time”
                   -- Werner Vogels
“Things will crash. Deal with it”
                        -- Jeff Dean
2-4% of servers
                                will die annually



Source: Jeff Dean, LADIS 2009
1-5% of disk drives
                                 will die every year



Source: Jeff Dean, LADIS 2009
human errors
human errors
             ~20% admin issues have unintended consequences




Source: James Hamilton
scalable & available
assume sw/hw failure
design apps to be resilient
automation & alarming
US East Region               !"#$%&'()*+


                                T                 T
Availability     Availabili...
elastic load balancing


                           CloudWatch
auto scaling

                              SNS
           ...
cost effective
pa
          ya
             sy
                ou
cost effective     go
on-demand instances
 reserved instances
   spot instances
A MAZON	
   V PC 	
   A RCHITECTURE
                                                                                  Cust...
AWS + science = win
3.7 million classifications in just over three days
~15 million in less than a month
>2.6 million clicks in 100 hours
lots and lots and lots and lots
 and lots and lots of data and
  lots and lots of lots of data
scalability & availability
we are data geeks not data center geeks
data management
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Biomarker Warehouse
pre-clinical, clinical, 3rd party data and publications

              ;<./5'=>?6@'               !)*(...
data processing
http://cyclecomputing.com
http://web.mit.edu/stardev/cluster/
sudo gem install cloud-crowd

     http://cyclecomputing.com
http://wiki.github.com/documentcloud/cloud-crowd
http://www.rightscale.com
XAFS




http://leonardo.phys.washington.edu/feff/
http://bioteam.net
ASSEMBLING GENOMES


                                     140	
  million	
  454	
  reads




Image:	
  Ma)	
  Wood
BLAT @ U. PENN
Map 100 million, 100 base paired end reads
Quad core with 5 GB of RAM would take 16 days




30 high-memory...
HEAVY-ION COLLISIONS

Problem: Quark matter physics conference
imminent but no compute resources handy

Solution: NIMBUS c...
BELLE MONTE CARLO




Credit: Tom Fifield
disk read/writes
slow & expensive
data processing
 fast & cheap
distribute the data
   parallel reads
data processing for the cloud
distributed file system
        (HDFS)
map/reduce
Via Cloudera under a Creative Commons License
Via Cloudera under a Creative Commons License
http://www.cascading.org/
apache pig



http://hadoop.apache.org/pig/
apache hive



 http://hadoop.apache.org/hive/
work by @peteskomoroch
High Throughput Sequence Analysis
Mike Schatz, University of Maryland
CloudBurst




Catalog k-mers                          Collect seeds                          End-to-end alignment



    ...
Crossbow: Rapid whole
 genome SNP analysis

                                                           Ben Langmead




  ...
Crossbow: Rapid whole genome SNP analysis


                             Preprocessed reads


                            ...
Crossbow	
   condenses	
   over	
   1,000	
   hours	
   of	
  
resequencing	
   computa:on	
   into	
   a	
   few	
   hour...
Scalable Genome Assembly




                                      Assembly of Large Genomes with Cloud Computing.
http://...
Amazon Elastic MapReduce


                                      Amazon EC2 Instances
                                    ...
data storage & distribution
         public & private
http://aws.amazon.com/publicdatasets/
sharing and collaboration
software distribution
http://www.cloudbiolinux.com/
http://usegalaxy.org/cloud
application platforms
http://heroku.com
http://chempedia.com/
Image: O’Reilly Radar
business models
to conclude
built for scale
built for availability
shared dataspaces
global namespaces
task-based resources
new software architectures
new computing platforms
available today
http://aws.amazon.com/education
Thank	
  you!




deesingh@amazon.com	
  Twi?er:@mndoci	
  
      Presenta2on	
  ideas	
  from	
  James	
  Hamilton,	
  @m...
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Talk at Microsoft Cloud Futures 2010
Upcoming SlideShare
Loading in …5
×

Talk at Microsoft Cloud Futures 2010

3,928 views

Published on

My talk from Cloud Futures 2010, organized by MSR

Published in: Technology

Talk at Microsoft Cloud Futures 2010

  1. 1. Scien&fic  Compu&ng  with  Amazon  Web  Services Deepak  Singh Cloud  Futures  2010
  2. 2. Via Reavel under a CC-BY-NC-ND license
  3. 3. life science industry
  4. 4. Credit: Bosco Ho
  5. 5. By ~Prescott under a CC-BY-NC license
  6. 6. data
  7. 7. Image: Wikipedia
  8. 8. Image  via  image  editor  under  a  CC-­‐BY  License
  9. 9. Image: Matt Wood
  10. 10. Image: NOAA
  11. 11. gigabytes
  12. 12. terabytes
  13. 13. petabytes
  14. 14. petabytes
  15. 15. ex ab y tes petabytes ?
  16. 16. really fast
  17. 17. Image: http://www.broadinstitute.org/~apleite/photos.html
  18. 18. data management
  19. 19. data processing
  20. 20. data sharing
  21. 21. Image: Chris Dagdigian
  22. 22. compute & storage limited
  23. 23. amazon web services
  24. 24. the cloud
  25. 25. has_many :definitions
  26. 26. infrastructure as a service
  27. 27. Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  28. 28. Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  29. 29. Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  30. 30. Your Custom Applications and Services Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  31. 31. elasticity
  32. 32. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  33. 33. scalability
  34. 34. > 1PB of data in S3
  35. 35. highly available
  36. 36. Image: Chris Dagdigian
  37. 37. “Everything fails, all the time” -- Werner Vogels
  38. 38. “Things will crash. Deal with it” -- Jeff Dean
  39. 39. 2-4% of servers will die annually Source: Jeff Dean, LADIS 2009
  40. 40. 1-5% of disk drives will die every year Source: Jeff Dean, LADIS 2009
  41. 41. human errors
  42. 42. human errors ~20% admin issues have unintended consequences Source: James Hamilton
  43. 43. scalable & available
  44. 44. assume sw/hw failure
  45. 45. design apps to be resilient
  46. 46. automation & alarming
  47. 47. US East Region !"#$%&'()*+ T T Availability Availability Zone A Zone B Availability Availability T Zone C Zone D
  48. 48. elastic load balancing CloudWatch auto scaling SNS SQS elastic IP elastic block store
  49. 49. cost effective
  50. 50. pa ya sy ou cost effective go
  51. 51. on-demand instances reserved instances spot instances
  52. 52. A MAZON   V PC   A RCHITECTURE Customer’s isolated AWS resources 10 .32 . 2. 0 /24 Subnets 10 10 .32 .32 .1.0 . 3. 0 /24 /24 VPN Gateway Amazon Secure VPN Connection Web Services over the Internet Cloud External Your Network Customers
  53. 53. AWS + science = win
  54. 54. 3.7 million classifications in just over three days ~15 million in less than a month >2.6 million clicks in 100 hours
  55. 55. lots and lots and lots and lots and lots and lots of data and lots and lots of lots of data
  56. 56. scalability & availability
  57. 57. we are data geeks not data center geeks
  58. 58. data management
  59. 59. Shaq Image: Keith Allison under a CC-BY-SA license
  60. 60. Shaq Image: Keith Allison under a CC-BY-SA license
  61. 61. Shaq Image: Keith Allison under a CC-BY-SA license
  62. 62. Shaq Image: Keith Allison under a CC-BY-SA license
  63. 63. Shaq Image: Keith Allison under a CC-BY-SA license
  64. 64. Biomarker Warehouse pre-clinical, clinical, 3rd party data and publications ;<./5'=>?6@' !)*(%"&&' 23,3415'61789:1' !#%&$(%&&&' +,'-./01' !"#$%"&&' 6178170' 6A.7341' B817-135' Estimated cost: 10 TB warehouse over 3 years
  65. 65. data processing
  66. 66. http://cyclecomputing.com
  67. 67. http://web.mit.edu/stardev/cluster/
  68. 68. sudo gem install cloud-crowd http://cyclecomputing.com http://wiki.github.com/documentcloud/cloud-crowd
  69. 69. http://www.rightscale.com
  70. 70. XAFS http://leonardo.phys.washington.edu/feff/
  71. 71. http://bioteam.net
  72. 72. ASSEMBLING GENOMES 140  million  454  reads Image:  Ma)  Wood
  73. 73. BLAT @ U. PENN Map 100 million, 100 base paired end reads Quad core with 5 GB of RAM would take 16 days 30 high-memory instances; 32 hours; $195
  74. 74. HEAVY-ION COLLISIONS Problem: Quark matter physics conference imminent but no compute resources handy Solution: NIMBUS context broker allowed researchers to provision 300 nodes and get the simulations done
  75. 75. BELLE MONTE CARLO Credit: Tom Fifield
  76. 76. disk read/writes slow & expensive
  77. 77. data processing fast & cheap
  78. 78. distribute the data parallel reads
  79. 79. data processing for the cloud
  80. 80. distributed file system (HDFS)
  81. 81. map/reduce
  82. 82. Via Cloudera under a Creative Commons License
  83. 83. Via Cloudera under a Creative Commons License
  84. 84. http://www.cascading.org/
  85. 85. apache pig http://hadoop.apache.org/pig/
  86. 86. apache hive http://hadoop.apache.org/hive/
  87. 87. work by @peteskomoroch
  88. 88. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  89. 89. CloudBurst Catalog k-mers Collect seeds End-to-end alignment http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  90. 90. Crossbow: Rapid whole genome SNP analysis Ben Langmead http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  91. 91. Crossbow: Rapid whole genome SNP analysis Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP Langmead B, Schatz MC, Lin, J, Pop M, Salzberg SL. Genome Biol 10(11): R134.
  92. 92. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  93. 93. Scalable Genome Assembly Assembly of Large Genomes with Cloud Computing. http://contrail-bio.sourceforge.net Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.
  94. 94. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  95. 95. data storage & distribution public & private
  96. 96. http://aws.amazon.com/publicdatasets/
  97. 97. sharing and collaboration
  98. 98. software distribution
  99. 99. http://www.cloudbiolinux.com/
  100. 100. http://usegalaxy.org/cloud
  101. 101. application platforms
  102. 102. http://heroku.com
  103. 103. http://chempedia.com/
  104. 104. Image: O’Reilly Radar
  105. 105. business models
  106. 106. to conclude
  107. 107. built for scale
  108. 108. built for availability
  109. 109. shared dataspaces global namespaces
  110. 110. task-based resources
  111. 111. new software architectures
  112. 112. new computing platforms
  113. 113. available today
  114. 114. http://aws.amazon.com/education
  115. 115. Thank  you! deesingh@amazon.com  Twi?er:@mndoci   Presenta2on  ideas  from  James  Hamilton,  @mza  and  @lessig

×