Systems Bioinformatics Workshop Keynote

There is no magic
There is only awesome
A platform for data science

D e e p a k S i n g h

bioinformatics

image: Ethan Hein

Source: http://www.nature.com/news/specials/bigdata/index.html

change how we think
about compute

change how we think
about data

change how we think
about science

dataspaces

Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data

the unreasonable
effectiveness of data

Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)

beyond the
database
and the data
warehouse

data as a
programmable
resource

compute as a fungible
commodity

3000 CPU’s for one firm’s risk management application
3444JJ'
!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'
8%%9%.:/'

344'JJ'

I%:.%/:1=' ;<"&/:1=' A&B:1=' C10"&:1=' C".:1=' E(.:1=' ;"%/:1='
>?,,?,44@' >?,3?,44@' >?,>?,44@' >?,H?,44@' >?,D?,44@' >?,F?,44@' >?,G?,44@'

“infrastructure as
code”

include_recipe "packages"
include_recipe "ruby"
include_recipe "apache2"

if platform?("centos","redhat")
if dist_only?
# just the gem, we'll install the apache module within apache2
package "rubygem-passenger"
return
else
package "httpd-devel"
end
else
%w{ apache2-prefork-dev libapr1-dev }.each do |pkg|
package pkg do
action :upgrade
end
end
end

gem_package "passenger" do
version node[:passenger][:version]
end

execute "passenger_module" do
command 'echo -en "nnnn" | passenger-install-apache2-module'
creates node[:passenger][:module_path]
end

import boto
import boto.emr
from boto.emr.step import StreamingStep

Connect to Elastic MapReduce
from boto.emr.bootstrap_action import BootstrapAction
import time

# set your aws keys and S3 bucket, e.g. from environment or .boto
AWSKEY=
SECRETKEY=
S3_BUCKET=
NUM_INSTANCES = 1

conn = boto.connect_emr(AWSKEY,SECRETKEY)

bootstrap_step = BootstrapAction("download.tst",
"s3://elasticmapreduce/bootstrap-actions/download.sh",None)
Install packages
step = StreamingStep(name='Wordcount',
                     mapper='s3n://elasticmapreduce/samples/wordcount/wordSplitter.py',
                     cache_files = ["s3n://" + S3_BUCKET + "/boto.mod#boto.mod"],
                     reducer='aggregate',
                     input='s3n://elasticmapreduce/samples/wordcount/input',
                     output='s3n://' + S3_BUCKET + '/output/wordcount_output')
Set up mappers &
jobid = conn.run_jobflow(
    name="testbootstrap",
reduces
    log_uri="s3://" + S3_BUCKET + "/logs",
    steps = [step],
    bootstrap_actions=[bootstrap_step],
    num_instances=NUM_INSTANCES)

print "finished spawning job (note: starting still takes time)"

state = conn.describe_jobflow(jobid).state
print "job state = ", state
print "job id = ", jobid
while state != u'COMPLETED':
    print time.localtime() job state
    time.sleep(30)
    state = conn.describe_jobflow(jobid).state
    print "job state = ", state
    print "job id = ", jobid

print "final output can be found in s3://" + S3_BUCKET + "/output" + TIMESTAMP
print "try: $ s3cmd sync s3://" + S3_BUCKET + "/output" + TIMESTAMP + " ."

“I terminate the
instance and
relaunch it. Thats
my error handling”
Source: @jtimberman on Twitter

compute is a fungible
commodity

you can get a lot of
awesome

s
pe
ty
ce
an

standard “m1”
st
in
2
EC

high cpu “c1”
high memory “m2”

http://aws.amazon.com/ec2/instance-types/

cluster compute instances

text

s
pe
ty
ce
an
st
in

cluster compute “cc1”
2
EC

cluster GPU “cg1”

http://aws.amazon.com/ec2/instance-types/

Linpack benchmark

880-instance CC1 cluster
Performance: 41.82 TFlops*

*#231 in Nov 2010 Top 500 rankings

Credit: K. Jorissen, F. D.Villa, and J. J. Rehr

WIEN2k Parallel Performance (U. Washington)

KS for huge system
at 1 k-point

H size 56,000 (25GB)
Runtime (16x8 processors)
Local (Inﬁniband) 3h:48
Cloud (10Gbps) 1h:30 ($40)

VERY DEMANDING
network performance

•1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100k

Ideal Effective Utilization
Spot Utilization
% Utilization

On Demand Utilization

Reserved Utilization

time

everything fails all the
time

Customer Case Study: cyclopic energy

OpenFOAM®

http://aws.amazon.com/solutions/case-studies/cyclopic-energy/

Credit: Angel Pizzaro, U. Penn

http://aws.amazon.com/solutions/case-studies/numerate/

Bioproximity

http://aws.amazon.com/solutions/case-studies/bioproximity/

mapreduce for
genomics
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
http://contrail-bio.sourceforge.net
http://bowtie-bio.sourceforge.net/myrna/index.shtml

large scale data
requires a rethink

distributed,
programmable
infrastructure

architecture evolved
with the internet

can we build data
science platforms?

there is no magic
there is only awesome

http://aws.amazon.com/about-aws/build-a-cluster-in-under-10/

http://aws.amazon.com/education

deesingh@amazon.com
Twitter:@mndoci
http://slideshare.net/mndoci
http://mndoci.com

Inspiration and ideas from
Matt Wood& Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license

Systems Bioinformatics Workshop Keynote

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Systems Bioinformatics Workshop Keynote

Similar to Systems Bioinformatics Workshop Keynote (20)

More from Deepak Singh

More from Deepak Singh (19)

Recently uploaded

Recently uploaded (20)

Systems Bioinformatics Workshop Keynote