Talk at NCRR P41 Director's Meeting

1,815 views

Published on

Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,815
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Talk at NCRR P41 Director's Meeting

  1. 1. Amazon Web Services A platform for life science research Deepak Singh, Ph.D. Amazon Web Services NCRR P41 PI meeting, October 2010
  2. 2. the new reality
  3. 3. lots and lots and lots and lots and lots of data
  4. 4. lots and lots and lots and lots and lots of people
  5. 5. lots and lots and lots and lots and lots of places
  6. 6. constant change
  7. 7. science in a new reality
  8. 8. science in a new reality ^
  9. 9. data science in a new reality ^
  10. 10. Image: Drew Conway
  11. 11. goal
  12. 12. optimize the most valuable resource
  13. 13. compute, storage, workflows, memory, transmission, algorithms, cost, …
  14. 14. people Credit: Pieter Musterd a CC-BY-NC-ND license
  15. 15. enter the cloud
  16. 16. what is the cloud?
  17. 17. infrastructure
  18. 18. scalable
  19. 19. 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
  20. 20. highly available
  21. 21. US East Region Availability Availability Zone A Zone B Availability Availability Zone C Zone D
  22. 22. durable
  23. 23. 99.999999999%
  24. 24. dynamic
  25. 25. extensible
  26. 26. secure
  27. 27. a utility
  28. 28. on-demand instances reserved instances spot instances
  29. 29. infrastructure as code
  30. 30. class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] end end
  31. 31. include_recipe "packages" include_recipe "ruby" include_recipe "apache2" if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" end else %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end end end gem_package "passenger" do version node[:passenger][:version] end execute "passenger_module" do command 'echo -en "nnnn" | passenger-install-apache2-module' creates node[:passenger][:module_path] end
  32. 32. import boto import boto.emr from boto.emr.step import StreamingStep Connect to Elastic MapReduce from boto.emr.bootstrap_action import BootstrapAction import time # set your aws keys and S3 bucket, e.g. from environment or .boto AWSKEY= SECRETKEY= S3_BUCKET= NUM_INSTANCES = 1 conn = boto.connect_emr(AWSKEY,SECRETKEY) bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/download.sh",None) Install packages step = StreamingStep(name='Wordcount',                      mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',                      cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                      reducer='aggregate',                      input='s3n://elasticmapreduce/samples/wordcount/input',                      output='s3n://' + S3_BUCKET + '/output/wordcount_output') Set up mappers & jobid = conn.run_jobflow(     name="testbootstrap", reduces     log_uri="s3://" + S3_BUCKET + "/logs",     steps = [step],     bootstrap_actions=[bootstrap_step],     num_instances=NUM_INSTANCES) print "finished spawning job (note: starting still takes time)" state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid while state != u'COMPLETED':     print time.localtime() job state     time.sleep(30)     state = conn.describe_jobflow(jobid).state     print "job state = ", state     print "job id = ", jobid print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
  33. 33. a data science platform
  34. 34. dataspaces Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
  35. 35. accept all data formats
  36. 36. evolve APIs
  37. 37. beyond the database and the data warehouse
  38. 38. move compute to the data
  39. 39. data is a royal garden
  40. 40. compute is a fungible commodity
  41. 41. “I terminate the instance and relaunch it. Thats my error handling” Source: @jtimberman on Twitter
  42. 42. the cloud is an architectural and cultural fit for data science
  43. 43. amazon web services
  44. 44. your data science platform
  45. 45. s3://1000genomes
  46. 46. http://aws.amazon.com/publicdatasets/
  47. 47. Credit: Angel Pizzaro, U. Penn
  48. 48. http://usegalaxy.org/cloud
  49. 49. mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
  50. 50. AWS knows scalable infrastructure
  51. 51. you know the science
  52. 52. we can make this work together
  53. 53. http://aws.amazon.com/education http://aws.amazon.com/publicdatasets
  54. 54. deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood, James Hamilton & Larry Lessig Credit” Oberazzi under a CC-BY-NC-SA license

×