Talk at NCRR P41 Director's Meeting
Upcoming SlideShare
Loading in...5

Talk at NCRR P41 Director's Meeting



Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010

Invited Talk given at the NCRR P41 Director's meeting on October 12, 2010



Total Views
Views on SlideShare
Embed Views



2 Embeds 11 8 3



Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Talk at NCRR P41 Director's Meeting Talk at NCRR P41 Director's Meeting Presentation Transcript

    • Amazon Web Services A platform for life science research Deepak Singh, Ph.D. Amazon Web Services NCRR P41 PI meeting, October 2010
    • the new reality
    • lots and lots and lots and lots and lots of data
    • lots and lots and lots and lots and lots of people
    • lots and lots and lots and lots and lots of places
    • constant change
    • science in a new reality
    • science in a new reality ^
    • data science in a new reality ^
    • Image: Drew Conway
    • goal
    • optimize the most valuable resource
    • compute, storage, workflows, memory, transmission, algorithms, cost, …
    • people Credit: Pieter Musterd a CC-BY-NC-ND license
    • enter the cloud
    • what is the cloud?
    • infrastructure
    • scalable
    • 3000 CPU’s for one firm’s risk management application 3444JJ' !"#$%&'()'*+,'-./01.2%/' 344'+567/'(.' 8%%9%.:/' 344'JJ' I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1=' >?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'
    • highly available
    • US East Region Availability Availability Zone A Zone B Availability Availability Zone C Zone D
    • durable
    • 99.999999999%
    • dynamic
    • extensible
    • secure
    • a utility
    • on-demand instances reserved instances spot instances
    • infrastructure as code
    • class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] end end
    • include_recipe "packages" include_recipe "ruby" include_recipe "apache2" if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" end else %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end end end gem_package "passenger" do version node[:passenger][:version] end execute "passenger_module" do command 'echo -en "nnnn" | passenger-install-apache2-module' creates node[:passenger][:module_path] end
    • import boto import boto.emr from boto.emr.step import StreamingStep Connect to Elastic MapReduce from boto.emr.bootstrap_action import BootstrapAction import time # set your aws keys and S3 bucket, e.g. from environment or .boto AWSKEY= SECRETKEY= S3_BUCKET= NUM_INSTANCES = 1 conn = boto.connect_emr(AWSKEY,SECRETKEY) bootstrap_step = BootstrapAction("download.tst", "s3://elasticmapreduce/bootstrap-actions/",None) Install packages step = StreamingStep(name='Wordcount',                      mapper='s3n://elasticmapreduce/samples/wordcount/',                      cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                      reducer='aggregate',                      input='s3n://elasticmapreduce/samples/wordcount/input',                      output='s3n://' + S3_BUCKET + '/output/wordcount_output') Set up mappers & jobid = conn.run_jobflow(     name="testbootstrap", reduces     log_uri="s3://" + S3_BUCKET + "/logs",     steps = [step],     bootstrap_actions=[bootstrap_step],     num_instances=NUM_INSTANCES) print "finished spawning job (note: starting still takes time)" state = conn.describe_jobflow(jobid).state print "job state = ", state print "job id = ", jobid while state != u'COMPLETED':     print time.localtime() job state     time.sleep(30)     state = conn.describe_jobflow(jobid).state     print "job state = ", state     print "job id = ", jobid print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
    • a data science platform
    • dataspaces Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
    • accept all data formats
    • evolve APIs
    • beyond the database and the data warehouse
    • move compute to the data
    • data is a royal garden
    • compute is a fungible commodity
    • “I terminate the instance and relaunch it. Thats my error handling” Source: @jtimberman on Twitter
    • the cloud is an architectural and cultural fit for data science
    • amazon web services
    • your data science platform
    • s3://1000genomes
    • Credit: Angel Pizzaro, U. Penn
    • mapreduce for genomics
    • AWS knows scalable infrastructure
    • you know the science
    • we can make this work together
    • Twitter:@mndoci Inspiration and ideas from Matt Wood, James Hamilton & Larry Lessig Credit” Oberazzi under a CC-BY-NC-SA license