Talk at Microsoft Cloud Futures 2010

  • 3,484 views
Uploaded on

My talk from Cloud Futures 2010, organized by MSR

My talk from Cloud Futures 2010, organized by MSR

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
3,484
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
170
Comments
2
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Scien&fic  Compu&ng  with  Amazon  Web  Services Deepak  Singh Cloud  Futures  2010
  • 2. Via Reavel under a CC-BY-NC-ND license
  • 3. life science industry
  • 4. Credit: Bosco Ho
  • 5. By ~Prescott under a CC-BY-NC license
  • 6. data
  • 7. Image: Wikipedia
  • 8. Image  via  image  editor  under  a  CC-­‐BY  License
  • 9. Image: Matt Wood
  • 10. Image: NOAA
  • 11. gigabytes
  • 12. terabytes
  • 13. petabytes
  • 14. petabytes
  • 15. ex ab y tes petabytes ?
  • 16. really fast
  • 17. Image: http://www.broadinstitute.org/~apleite/photos.html
  • 18. data management
  • 19. data processing
  • 20. data sharing
  • 21. Image: Chris Dagdigian
  • 22. compute & storage limited
  • 23. amazon web services
  • 24. the cloud
  • 25. has_many :definitions
  • 26. infrastructure as a service
  • 27. Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 28. Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 29. Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 30. Your Custom Applications and Services Tools Isolated Networks Monitoring Management AWS Toolkit for Eclipse Amazon Virtual Private Amazon CloudWatch AWS Management Console AWS SDK for .NET Cloud Messaging Amazon Simple Payments On-Demand Parallel Processing Content Delivery Queue Service (SQS) Amazon Flexible Workforce Amazon Elastic Amazon CloudFront Amazon Simple Payments Service Amazon Mechanical MapReduce Notification Service (FPS) Turk (SNS) Compute Storage Amazon Elastic Compute Database Amazon Simple Amazon RDS and Cloud (EC2) Storage Service (S3) - Elastic Load Balancing SimpleDB - AWS Import/Export - Auto Scaling
  • 31. elasticity
  • 32. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  • 33. scalability
  • 34. > 1PB of data in S3
  • 35. highly available
  • 36. Image: Chris Dagdigian
  • 37. “Everything fails, all the time” -- Werner Vogels
  • 38. “Things will crash. Deal with it” -- Jeff Dean
  • 39. 2-4% of servers will die annually Source: Jeff Dean, LADIS 2009
  • 40. 1-5% of disk drives will die every year Source: Jeff Dean, LADIS 2009
  • 41. human errors
  • 42. human errors ~20% admin issues have unintended consequences Source: James Hamilton
  • 43. scalable & available
  • 44. assume sw/hw failure
  • 45. design apps to be resilient
  • 46. automation & alarming
  • 47. US East Region !"#$%&'()*+ T T Availability Availability Zone A Zone B Availability Availability T Zone C Zone D
  • 48. elastic load balancing CloudWatch auto scaling SNS SQS elastic IP elastic block store
  • 49. cost effective
  • 50. pa ya sy ou cost effective go
  • 51. on-demand instances reserved instances spot instances
  • 52. A MAZON   V PC   A RCHITECTURE Customer’s isolated AWS resources 10 .32 . 2. 0 /24 Subnets 10 10 .32 .32 .1.0 . 3. 0 /24 /24 VPN Gateway Amazon Secure VPN Connection Web Services over the Internet Cloud External Your Network Customers
  • 53. AWS + science = win
  • 54. 3.7 million classifications in just over three days ~15 million in less than a month >2.6 million clicks in 100 hours
  • 55. lots and lots and lots and lots and lots and lots of data and lots and lots of lots of data
  • 56. scalability & availability
  • 57. we are data geeks not data center geeks
  • 58. data management
  • 59. Shaq Image: Keith Allison under a CC-BY-SA license
  • 60. Shaq Image: Keith Allison under a CC-BY-SA license
  • 61. Shaq Image: Keith Allison under a CC-BY-SA license
  • 62. Shaq Image: Keith Allison under a CC-BY-SA license
  • 63. Shaq Image: Keith Allison under a CC-BY-SA license
  • 64. Biomarker Warehouse pre-clinical, clinical, 3rd party data and publications ;<./5'=>?6@' !)*(%"&&' 23,3415'61789:1' !#%&$(%&&&' +,'-./01' !"#$%"&&' 6178170' 6A.7341' B817-135' Estimated cost: 10 TB warehouse over 3 years
  • 65. data processing
  • 66. http://cyclecomputing.com
  • 67. http://web.mit.edu/stardev/cluster/
  • 68. sudo gem install cloud-crowd http://cyclecomputing.com http://wiki.github.com/documentcloud/cloud-crowd
  • 69. http://www.rightscale.com
  • 70. XAFS http://leonardo.phys.washington.edu/feff/
  • 71. http://bioteam.net
  • 72. ASSEMBLING GENOMES 140  million  454  reads Image:  Ma)  Wood
  • 73. BLAT @ U. PENN Map 100 million, 100 base paired end reads Quad core with 5 GB of RAM would take 16 days 30 high-memory instances; 32 hours; $195
  • 74. HEAVY-ION COLLISIONS Problem: Quark matter physics conference imminent but no compute resources handy Solution: NIMBUS context broker allowed researchers to provision 300 nodes and get the simulations done
  • 75. BELLE MONTE CARLO Credit: Tom Fifield
  • 76. disk read/writes slow & expensive
  • 77. data processing fast & cheap
  • 78. distribute the data parallel reads
  • 79. data processing for the cloud
  • 80. distributed file system (HDFS)
  • 81. map/reduce
  • 82. Via Cloudera under a Creative Commons License
  • 83. Via Cloudera under a Creative Commons License
  • 84. http://www.cascading.org/
  • 85. apache pig http://hadoop.apache.org/pig/
  • 86. apache hive http://hadoop.apache.org/hive/
  • 87. work by @peteskomoroch
  • 88. High Throughput Sequence Analysis Mike Schatz, University of Maryland
  • 89. CloudBurst Catalog k-mers Collect seeds End-to-end alignment http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
  • 90. Crossbow: Rapid whole genome SNP analysis Ben Langmead http://bowtie-bio.sourceforge.net/crossbow/index.shtml
  • 91. Crossbow: Rapid whole genome SNP analysis Preprocessed reads Map: Bowtie Sort: Bin and partition Reduce: SoapSNP Langmead B, Schatz MC, Lin, J, Pop M, Salzberg SL. Genome Biol 10(11): R134.
  • 92. Crossbow   condenses   over   1,000   hours   of   resequencing   computa:on   into   a   few   hours   without   requiring   the   user   to   own   or   operate   a   computer  cluster
  • 93. Scalable Genome Assembly Assembly of Large Genomes with Cloud Computing. http://contrail-bio.sourceforge.net Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.
  • 94. Amazon Elastic MapReduce Amazon EC2 Instances End Deploy Application Hadoop Hadoop Hadoop Elastic Elastic MapReduce MapReduce Hadoop Hadoop Hadoop Notify Web Console, Command line tools Input output dataset results Input  S3   Output  S3   Get Results Input Data bucket bucket Amazon S3
  • 95. data storage & distribution public & private
  • 96. http://aws.amazon.com/publicdatasets/
  • 97. sharing and collaboration
  • 98. software distribution
  • 99. http://www.cloudbiolinux.com/
  • 100. http://usegalaxy.org/cloud
  • 101. application platforms
  • 102. http://heroku.com
  • 103. http://chempedia.com/
  • 104. Image: O’Reilly Radar
  • 105. business models
  • 106. to conclude
  • 107. built for scale
  • 108. built for availability
  • 109. shared dataspaces global namespaces
  • 110. task-based resources
  • 111. new software architectures
  • 112. new computing platforms
  • 113. available today
  • 114. http://aws.amazon.com/education
  • 115. Thank  you! deesingh@amazon.com  Twi?er:@mndoci   Presenta2on  ideas  from  James  Hamilton,  @mza  and  @lessig