• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Systems Bioinformatics Workshop Keynote
 

Systems Bioinformatics Workshop Keynote

on

  • 1,566 views

 

Statistics

Views

Total Views
1,566
Views on SlideShare
1,537
Embed Views
29

Actions

Likes
1
Downloads
0
Comments
0

5 Embeds 29

http://www.linkedin.com 17
http://mndoci.github.com 4
http://deepaksingh.net 4
http://paper.li 3
http://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Systems Bioinformatics Workshop Keynote Systems Bioinformatics Workshop Keynote Presentation Transcript

    • There is no magicThere is only awesome A platform for data science D e e p a k S i n g h
    • bioinformaticsimage: Ethan Hein
    • 3
    • collection
    • curation
    • analysis
    • what’s the big deal?
    • Source: http://www.nature.com/news/specials/bigdata/index.html
    • Image:Yael Fitzpatrick (AAAS)
    • Image:Yael Fitzpatrick (AAAS)
    • lots of data
    • lots of people
    • lots of places
    • to make data effective
    • versioning
    • provenance
    • filter
    • aggregate
    • extend
    • mashup
    • human interfaces
    • hard problem
    • really hard problem
    • change how we think about compute
    • change how we think about data
    • change how we think about science
    • information platforms
    • Image: Drew Conway
    • dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
    • the unreasonable effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
    • accept all data formats
    • evolve APIs
    • beyond the databaseand the data warehouse
    • data as aprogrammable resource
    • data as a royal garden
    • compute as a fungible commodity
    • which brings us to ...
    • amazon web services
    • common characteristics
    • on demand
    • pay as you go
    • secure
    • elastic
    • 3000 CPU’s for one firm’s risk management application 3444JJ!"#$%&()*+,-./01.2%/ 344+567/(. 8%%9%.:/ 344JJ I%:.%/:1= ;<"&/:1= A&B:1= C10"&:1= C".:1= E(.:1= ;"%/:1= >?,,?,44@ >?,3?,44@ >?,>?,44@ >?,H?,44@ >?,D?,44@ >?,F?,44@ >?,G?,44@
    • programmable
    • “infrastructure as code”
    • include_recipe "packages"include_recipe "ruby"include_recipe "apache2"if platform?("centos","redhat") if dist_only? # just the gem, well install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" endelse %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end endendgem_package "passenger" do version node[:passenger][:version]endexecute "passenger_module" do command echo -en "nnnn" | passenger-install-apache2-module creates node[:passenger][:module_path]end
    • import botoimport boto.emrfrom boto.emr.step import StreamingStep Connect to Elastic MapReducefrom boto.emr.bootstrap_action import BootstrapActionimport time# set your aws keys and S3 bucket, e.g. from environment or .botoAWSKEY=SECRETKEY=S3_BUCKET=NUM_INSTANCES = 1conn = boto.connect_emr(AWSKEY,SECRETKEY)bootstrap_step = BootstrapAction("download.tst","s3://elasticmapreduce/bootstrap-actions/download.sh",None) Install packagesstep = StreamingStep(name=Wordcount,                     mapper=s3n://elasticmapreduce/samples/wordcount/wordSplitter.py,                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],                     reducer=aggregate,                     input=s3n://elasticmapreduce/samples/wordcount/input,                     output=s3n:// + S3_BUCKET + /output/wordcount_output) Set up mappers &jobid = conn.run_jobflow(    name="testbootstrap", reduces    log_uri="s3://" + S3_BUCKET + "/logs",    steps = [step],    bootstrap_actions=[bootstrap_step],    num_instances=NUM_INSTANCES)print "finished spawning job (note: starting still takes time)"state = conn.describe_jobflow(jobid).stateprint "job state = ", stateprint "job id = ", jobidwhile state != uCOMPLETED:    print time.localtime() job state    time.sleep(30)    state = conn.describe_jobflow(jobid).state    print "job state = ", state    print "job id = ", jobidprint "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMPprint "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
    • “I terminate the instance and relaunch it. Thats my error handling”Source: @jtimberman on Twitter
    • compute is a fungible commodity
    • emphasis onproductivity
    • you can get a lot of awesome
    • dive in
    • just a little
    • S3Simple Storage Service
    • highly durable
    • 99.999999999%
    • Highly scalable
    • EC2Elastic Compute Cloud
    • dynamic
    • autoscaling
    • EC2 instance types
    • s pe ty ce an standard “m1” st in 2EC high cpu “c1” high memory “m2” http://aws.amazon.com/ec2/instance-types/
    • cluster compute instances text
    • cluster GPU instances
    • s pe ty ce an st in cluster compute “cc1” 2EC cluster GPU “cg1” http://aws.amazon.com/ec2/instance-types/
    • 10gbps
    • Placement Group
    • full bisection bandwidth
    • Linpack benchmark 880-instance CC1 clusterPerformance: 41.82 TFlops* *#231 in Nov 2010 Top 500 rankings
    • Credit: K. Jorissen, F. D.Villa, and J. J. RehrWIEN2k Parallel Performance (U. Washington) KS for huge system at 1 k-pointH size 56,000 (25GB)Runtime (16x8 processors) Local (Infiniband) 3h:48 Cloud (10Gbps) 1h:30 ($40) VERY DEMANDINGnetwork performance •1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k
    • cost and use models
    • Ideal Effective Utilization Spot Utilization% Utilization On Demand Utilization Reserved Utilization time
    • making things easier
    • Elastic Beanstalk
    • Heroku
    • Ideal Effective Utilization Spot Utilization% Utilization On Demand Utilization Reserved Utilization time
    • data at scale
    • some practicalconsiderations
    • everything fails all the time
    • compute needs vary
    • new data/compute paradigms
    • Amazon Elastic MapReduce
    • doing stuff
    • Customer Case Study: cyclopic energy OpenFOAM® http://aws.amazon.com/solutions/case-studies/cyclopic-energy/
    • NASA JPL
    • Credit: Angel Pizzaro, U. Penn
    • http://aws.amazon.com/solutions/case-studies/numerate/
    • Bioproximity http://aws.amazon.com/solutions/case-studies/bioproximity/
    • http://usegalaxy.org/cloud
    • mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
    • http://cloudbiolinux.org/
    • in summary
    • large scale datarequires a rethink
    • data architecture
    • compute architecture
    • in infrastructure
    • the cloud
    • distributed,programmable infrastructure
    • rapid, massive, scaling
    • architecture evolved with the internet
    • can we build datascience platforms?
    • there is no magicthere is only awesome
    • two more things
    • 10 minutes
    • http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/
    • http://aws.amazon.com/education
    • deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood& Larry LessigCredit” Oberazzi under a CC-BY-NC-SA license