There is no magic
There is only awesome
    A platform for data science


         D e e p a k   S i n g h
bioinformatics


image: Ethan Hein
3
collection
curation
analysis
what’s the big deal?
Source: http://www.nature.com/news/specials/bigdata/index.html
Image:Yael Fitzpatrick (AAAS)
Image:Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
to make data effective
versioning
provenance
filter
aggregate
extend
mashup
human interfaces
hard problem
really hard problem
change how we think
   about compute
change how we think
    about data
change how we think
   about science
information
 platforms
Image: Drew Conway
dataspaces


Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable
          effectiveness of data

Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data
   formats
evolve APIs
beyond the
  database
and the data
 warehouse
data as a
programmable
   resource
data as a royal garden
compute as a fungible
    commodity
which brings us to ...
amazon web services
common characteristics
on demand
pay as you go
secure
elastic
3000 CPU’s for one firm’s risk management application
     3444JJ'
!"#$%&'()'*+,'-./01.2%/'




                                                                    344'+567/'(.'
                                                                    8%%9%.:/'




            344'JJ'



                           I%:.%/:1='    ;<"&/:1='     A&B:1='     C10"&:1='    C".:1='      E(.:1='      ;"%/:1='
                           >?,,?,44@'   >?,3?,44@'   >?,>?,44@'   >?,H?,44@'   >?,D?,44@'   >?,F?,44@'   >?,G?,44@'
programmable
“infrastructure as
      code”
include_recipe "packages"
include_recipe "ruby"
include_recipe "apache2"

if platform?("centos","redhat")
  if dist_only?
     # just the gem, we'll install the apache module within apache2
     package "rubygem-passenger"
     return
  else
     package "httpd-devel"
  end
else
  %w{ apache2-prefork-dev libapr1-dev }.each do |pkg|
     package pkg do
       action :upgrade
     end
  end
end

gem_package "passenger" do
  version node[:passenger][:version]
end

execute "passenger_module" do
  command 'echo -en "nnnn" | passenger-install-apache2-module'
  creates node[:passenger][:module_path]
end
import boto
import boto.emr
from boto.emr.step import StreamingStep

                                                                                          Connect to Elastic MapReduce
from boto.emr.bootstrap_action import BootstrapAction
import time

# set your aws keys and S3 bucket, e.g. from environment or .boto
AWSKEY=
SECRETKEY=
S3_BUCKET=
NUM_INSTANCES = 1

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("download.tst",
"s3://elasticmapreduce/bootstrap-actions/download.sh",None)
                                                                                                Install packages
step = StreamingStep(name='Wordcount',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://' + S3_BUCKET + '/output/wordcount_output')
                                                                                               Set up mappers &
jobid = conn.run_jobflow(
    name="testbootstrap",
                                                                                                    reduces
    log_uri="s3://" + S3_BUCKET + "/logs",
    steps = [step],
    bootstrap_actions=[bootstrap_step],
    num_instances=NUM_INSTANCES)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).state
print "job state = ", state
print "job id = ", jobid
while state != u'COMPLETED':
    print time.localtime()                                                                          job state
    time.sleep(30)
    state = conn.describe_jobflow(jobid).state
    print "job state = ", state
    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP
print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."
“I terminate the
   instance and
   relaunch it. Thats
   my error handling”
Source: @jtimberman on Twitter
compute is a fungible
    commodity
emphasis on
productivity
you can get a lot of
     awesome
dive in
just a little
S3
Simple Storage
   Service
highly durable
99.999999999%
Highly scalable
EC2

Elastic Compute Cloud
dynamic
autoscaling
EC2 instance types
s
             pe
           ty
           ce
        an


             standard “m1”
     st
  in
  2
EC




              high cpu “c1”
           high memory “m2”


                http://aws.amazon.com/ec2/instance-types/
cluster compute instances


          text
cluster GPU instances
s
             pe
           ty
           ce
        an
     st
  in




        cluster compute “cc1”
  2
EC




            cluster GPU “cg1”


                http://aws.amazon.com/ec2/instance-types/
10gbps
Placement
  Group
full bisection
 bandwidth
Linpack benchmark

  880-instance CC1 cluster
Performance: 41.82 TFlops*


     *#231 in Nov 2010 Top 500 rankings
Credit: K. Jorissen, F. D.Villa, and J. J. Rehr

WIEN2k Parallel Performance                                                        (U. Washington)




 KS for huge system
 at 1 k-point


H size 56,000 (25GB)
Runtime (16x8 processors)
   Local (Infiniband) 3h:48
   Cloud (10Gbps) 1h:30 ($40)




 VERY DEMANDING
network performance




                       •1200    atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k
cost and use models
Ideal Effective Utilization
                Spot Utilization
% Utilization




                                           On Demand Utilization




                                             Reserved Utilization




                                    time
making things easier
Elastic Beanstalk
Heroku
Ideal Effective Utilization
                Spot Utilization
% Utilization




                                           On Demand Utilization




                                             Reserved Utilization




                                    time
data at scale
some practical
considerations
everything fails all the
         time
compute needs vary
new data/compute
   paradigms
Amazon Elastic MapReduce
doing stuff
Customer Case Study: cyclopic energy




                           OpenFOAM®


         http://aws.amazon.com/solutions/case-studies/cyclopic-energy/
NASA JPL
Credit: Angel Pizzaro, U. Penn
http://aws.amazon.com/solutions/case-studies/numerate/
Bioproximity




          http://aws.amazon.com/solutions/case-studies/bioproximity/
http://usegalaxy.org/cloud
mapreduce for
  genomics
 http://bowtie-bio.sourceforge.net/crossbow/index.shtml
            http://contrail-bio.sourceforge.net
   http://bowtie-bio.sourceforge.net/myrna/index.shtml
http://cloudbiolinux.org/
in summary
large scale data
requires a rethink
data architecture
compute architecture
in infrastructure
the cloud
distributed,
programmable
 infrastructure
rapid, massive, scaling
architecture evolved
  with the internet
can we build data
science platforms?
there is no magic
there is only awesome
two more things
10 minutes
http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/
http://aws.amazon.com/education
deesingh@amazon.com
                                                             Twitter:@mndoci
                                               http://slideshare.net/mndoci
                                                          http://mndoci.com




         Inspiration and ideas from
          Matt Wood& Larry Lessig


Credit” Oberazzi under a CC-BY-NC-SA license

Systems Bioinformatics Workshop Keynote