Platforms for data science




Deepak Singh, Ph.D.
Amazon Web Services


                      Data transmission for international genomics projects 2010
the new reality
lots and lots and lots
and lots and lots of data
lots and lots and lots
 and lots and lots of
       people
lots and lots and lots
 and lots and lots of
       places
constant change
science in a new reality
science in a new reality
^
data
  science in a new reality
^
data as a
programmable resource
versioning
provenance capture
filter
aggregate
integrate
extend
mashup
automate
human interfaces
tough problem
really tough problem in
     the new reality
goal
optimize the most
valuable resource
compute, storage,
   workflows, memory,
transmission, algorithms,
         cost, …
people



Credit: Pieter Musterd a CC-BY-NC-ND license
enter the cloud
what is the cloud?
infrastructure
scalable
highly available
dynamic
extensible
secure
a utility
programmable
class Instance
    attr_accessor :aws_hash, :elastic_ip

      def initialize(hash, elastic_ip = nil)
        @aws_hash = hash
        @elastic_ip = elastic_ip
      end

      def public_dns
        @aws_hash[:dns_name] || ""
      end

      def friendly_name
        public_dns.empty? ? status.capitalize : public_dns.split(".")[0]
      end

      def id
        @aws_hash[:aws_instance_id]
      end
end
include_recipe "packages"
include_recipe "ruby"
include_recipe "apache2"

if platform?("centos","redhat")
  if dist_only?
     # just the gem, we'll install the apache module within apache2
     package "rubygem-passenger"
     return
  else
     package "httpd-devel"
  end
else
  %w{ apache2-prefork-dev libapr1-dev }.each do |pkg|
     package pkg do
       action :upgrade
     end
  end
end

gem_package "passenger" do
  version node[:passenger][:version]
end

execute "passenger_module" do
  command 'echo -en "nnnn" | passenger-install-apache2-module'
  creates node[:passenger][:module_path]
end
a data science platform
dataspaces



Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
accept all data formats
evolve APIs
beyond the database
and the data warehouse
move compute to the
       data
data is a royal garden
compute is a
fungible commodity
“I terminate the instance
   and relaunch it. Thats my
   error handling”


Source: @jtimberman on Twitter
the cloud is an
architectural and
  cultural fit for
  data science
amazon web services
your data science platform
s3://1000genomes
Credit: Angel Pizzaro, U. Penn
http://usegalaxy.org/cloud
mapreduce for
    genomics

http://bowtie-bio.sourceforge.net/crossbow/index.shtml
           http://contrail-bio.sourceforge.net
  http://bowtie-bio.sourceforge.net/myrna/index.shtml
AWS knows massively
scalable infrastructure
you know the needs of
     the science
we can make this work
      together
deesingh@amazon.com
                                                            Twitter:@mndoci
                                               http://slideshare.net/mndoci




        Inspiration and ideas from
        Matt Wood, James Hamilton
               & Larry Lessig

Credit” Oberazzi under a CC-BY-NC-SA license

Platforms for data science