Talk at Microsoft Cloud Futures 2010

3,770 views
3,695 views

Published on

My talk from Cloud Futures 2010, organized by MSR

Published in: Technology
2 Comments
2 Likes
Statistics
Notes
No Downloads
Views
Total views
3,770
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
171
Comments
2
Likes
2
Embeds 0
No embeds

No notes for slide

Talk at Microsoft Cloud Futures 2010

  1. 1. Scien&fic  Compu&ng  with  Amazon  Web  Services Deepak  Singh Cloud  Futures  2010
  2. 2. Via Reavel under a CC-BY-NC-ND license
  3. 3. life science industry
  4. 4. Credit: Bosco Ho
  5. 5. By ~Prescott under a CC-BY-NC license
  6. 6. data
  7. 7. Image: Wikipedia
  8. 8. Image  via  image  editor  under  a  CC-­‐BY  License
  9. 9. Image: Matt Wood
  10. 10. Image: NOAA
  11. 11. gigabytes
  12. 12. terabytes
  13. 13. petabytes
  14. 14. petabytes
  15. 15. ex ab y tes petabytes ?
  16. 16. really fast
  17. 17. Image: http://www.broadinstitute.org/~apleite/photos.html
  18. 18. data management
  19. 19. data processing
  20. 20. data sharing
  21. 21. Image: Chris Dagdigian
  22. 22. compute & storage limited
  23. 23. amazon web services
  24. 24. the cloud
  25. 25. has_many :definitions
  26. 26. infrastructure as a service
  27. 27. Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  28. 28. Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  29. 29. Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  30. 30. Your Custom Applications and Services Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  31. 31. elasticity
  32. 32. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  33. 33. scalability
  34. 34. > 1PB of data in S3
  35. 35. highly available
  36. 36. Image: Chris Dagdigian
  37. 37. “Everything fails, all the time” -- Werner Vogels
  38. 38. “Things will crash. Deal with it” -- Jeff Dean
  39. 39. 2-4% of servers will die annually Source: Jeff Dean, LADIS 2009
  40. 40. 1-5% of disk drives will die every year Source: Jeff Dean, LADIS 2009
  41. 41. human errors
  42. 42. human errors ~20% admin issues have unintended consequences Source: James Hamilton
  43. 43. scalable & available
  44. 44. assume sw/hw failure
  45. 45. design apps to be resilient
  46. 46. automation & alarming
  47. 47. US East Region !"#$%&'()*+ T T Availability Availability Zone A Zone B Availability Availability T Zone C Zone D
  48. 48. elastic load balancing CloudWatch auto scaling SNS SQS elastic IP elastic block store
  49. 49. cost effective
  50. 50. pa ya sy ou cost effective go
  51. 51. on-demand instances reserved instances spot instances
  52. 52. A MAZON   V PC   A RCHITECTURE Customer’s isolated AWS resources 10 .32 . 2. 0 /24 Subnets 10 10 .32 .32 .1.0 . 3. 0 /24 /24 VPN Gateway Amazon Secure VPN Connection Web Services over the Internet Cloud External Your Network Customers
  53. 53. AWS + science = win
  54. 54. 3.7 million classifications in just over three days ~15 million in less than a month >2.6 million clicks in 100 hours
  55. 55. lots and lots and lots and lots and lots and lots of data and lots and lots of lots of data
  56. 56. scalability & availability
  57. 57. we are data geeks not data center geeks
  58. 58. data management
  59. 59. Shaq Image: Keith Allison under a CC-BY-SA license
  60. 60. Shaq Image: Keith Allison under a CC-BY-SA license
  61. 61. Shaq Image: Keith Allison under a CC-BY-SA license
  62. 62. Shaq Image: Keith Allison under a CC-BY-SA license
  63. 63. Shaq Image: Keith Allison under a CC-BY-SA license
  64. 64. Biomarker Warehouse pre-clinical, clinical, 3rd party data and publications ;<./5'=>?6@' !)*(%"&&' 23,3415'61789:1' !#%&$(%&&&' +,'-./01' !"#$%"&&' 6178170' 6A.7341' B817-135' Estimated cost: 10 TB warehouse over 3 years
  65. 65. data processing
  66. 66. http://cyclecomputing.com
  67. 67. http://web.mit.edu/stardev/cluster/
  68. 68. sudo gem install cloud-crowd http://cyclecomputing.com http://wiki.github.com/documentcloud/cloud-crowd
  69. 69. http://www.rightscale.com
  70. 70. XAFS http://leonardo.phys.washington.edu/feff/
  71. 71. http://bioteam.net
  72. 72. ASSEMBLING GENOMES 140  million  454  reads Image:  Ma)  Wood
  73. 73. BLAT @ U. PENN Map 100 million, 100 base paired end reads Quad core with 5 GB of RAM would take 16 days 30 high-memory instances; 32 hours; $195
  74. 74. HEAVY-ION COLLISIONS Problem: Quark matter physics conference imminent but no compute resources handy Solution: NIMBUS context broker allowed researchers to provision 300 nodes and get the simulations done
  75. 75. BELLE MONTE CARLO Credit: Tom Fifield
  76. 76. disk read/writes slow & expensive
  77. 77. data processing fast & cheap
  78. 78. distribute the data parallel reads
  79. 79. data processing for the cloud
  80. 80. distributed file system (HDFS)
  81. 81. map/reduce
  82. 82. Via Cloudera under a Creative Commons License
  83. 83. Via Cloudera under a Creative Commons License
  84. 84. http://www.cascading.org/
  85. 85. apache pig http://hadoop.apache.org/pig/
  86. 86. apache hive http://hadoop.apache.org/hive/
  87. 87. work by @peteskomoroch
  88. 88. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  89. 89. CloudBurst Catalog k-mers Collect seeds End-to-end alignment http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  90. 90. Crossbow: Rapid whole genome SNP analysis Ben Langmead http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  91. 91. Crossbow: Rapid whole genome SNP analysis Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP Langmead B, Schatz MC, Lin, J, Pop M, Salzberg SL. Genome Biol 10(11): R134.
  92. 92. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  93. 93. Scalable Genome Assembly Assembly of Large Genomes with Cloud Computing. http://contrail-bio.sourceforge.net Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.
  94. 94. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  95. 95. data storage & distribution public & private
  96. 96. http://aws.amazon.com/publicdatasets/
  97. 97. sharing and collaboration
  98. 98. software distribution
  99. 99. http://www.cloudbiolinux.com/
  100. 100. http://usegalaxy.org/cloud
  101. 101. application platforms
  102. 102. http://heroku.com
  103. 103. http://chempedia.com/
  104. 104. Image: O’Reilly Radar
  105. 105. business models
  106. 106. to conclude
  107. 107. built for scale
  108. 108. built for availability
  109. 109. shared dataspaces global namespaces
  110. 110. task-based resources
  111. 111. new software architectures
  112. 112. new computing platforms
  113. 113. available today
  114. 114. http://aws.amazon.com/education
  115. 115. Thank  you! deesingh@amazon.com  Twi?er:@mndoci   Presenta2on  ideas  from  James  Hamilton,  @mza  and  @lessig

×