There is no magic
There is only awesome
    Platforms for data science


        D e e p a k   S i n g h
bioinformatics


image: Ethan Hein
3
collection
curation
analysis
what’s the big deal?
Source: http://www.nature.com/news/specials/bigdata/index.html
Image:Yael Fitzpatrick (AAAS)
Image:Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
constant change
we want to make our
data more effective
versioning
provenance
filter
aggregate
extend
mashup
human interfaces
image: Leo Reynolds
hard problem
really hard problem
so how do
get there?
information
 platforms
Image: Drew Conway
dataspaces


Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable
          effectiveness of data

Halevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data
   formats
evolve APIs
beyond databases and
 the data warehouse
data as a
programmable
   resource
data is a
royal garden
compute is a
fungible commodity
optimizing the most
 valuable resource
compute, storage,
   workflows, memory,
transmission, algorithms,
         cost, …
people



Credit: Pieter Musterd a CC-BY-NC-ND license
Image: Chris Dagdigian
my bias
cloud services
distributed systems
scale
global
consumption
  models
on-demand
what is the value of
   your data?
Credit: Angel Pizzaro, U. Penn
mapreduce for
  genomics
 http://bowtie-bio.sourceforge.net/crossbow/index.shtml
            http://contrail-bio.sourceforge.net
   http://bowtie-bio.sourceforge.net/myrna/index.shtml
Bioproximity




          http://aws.amazon.com/solutions/case-studies/bioproximity/
30,472 cores
$1279/hr
http://cloudbiolinux.org/
http://usegalaxy.org/cloud
in summary
large scale data
requires a rethink
data architecture
compute architecture
distributed,
programmable
 infrastructure
cloud services
remove constraints
can we build data
science platforms?
there is no magic
there is only awesome
deesingh@amazon.com
                                                             Twitter:@mndoci
                                               http://slideshare.net/mndoci
                                                          http://mndoci.com




         Inspiration and ideas from
          Matt Wood& Larry Lessig


Credit” Oberazzi under a CC-BY-NC-SA license

Platforms for Data Science - Computing on the Brink