Platforms for data science

6,017 views
5,614 views

Published on

Life science research, data platforms and cloud computing

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • Interesting presentation. Data science platforms like Kaggle and GingerBrain are also quite interesting!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
6,017
On SlideShare
0
From Embeds
0
Number of Embeds
49
Actions
Shares
0
Downloads
59
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Platforms for data science

  1. Platforms for data science Deepak Singh, Ph.D. Amazon Web Services Data transmission for international genomics projects 2010
  2. the new reality
  3. lots and lots and lots and lots and lots of data
  4. lots and lots and lots and lots and lots of people
  5. lots and lots and lots and lots and lots of places
  6. constant change
  7. science in a new reality
  8. science in a new reality ^
  9. data science in a new reality ^
  10. data as a programmable resource
  11. versioning
  12. provenance capture
  13. filter
  14. aggregate
  15. integrate
  16. extend
  17. mashup
  18. automate
  19. human interfaces
  20. tough problem
  21. really tough problem in the new reality
  22. goal
  23. optimize the most valuable resource
  24. compute, storage, workflows, memory, transmission, algorithms, cost, …
  25. people Credit: Pieter Musterd a CC-BY-NC-ND license
  26. enter the cloud
  27. what is the cloud?
  28. infrastructure
  29. scalable
  30. highly available
  31. dynamic
  32. extensible
  33. secure
  34. a utility
  35. programmable
  36. class Instance attr_accessor :aws_hash, :elastic_ip def initialize(hash, elastic_ip = nil) @aws_hash = hash @elastic_ip = elastic_ip end def public_dns @aws_hash[:dns_name] || "" end def friendly_name public_dns.empty? ? status.capitalize : public_dns.split(".")[0] end def id @aws_hash[:aws_instance_id] end end
  37. include_recipe "packages" include_recipe "ruby" include_recipe "apache2" if platform?("centos","redhat") if dist_only? # just the gem, we'll install the apache module within apache2 package "rubygem-passenger" return else package "httpd-devel" end else %w{ apache2-prefork-dev libapr1-dev }.each do |pkg| package pkg do action :upgrade end end end gem_package "passenger" do version node[:passenger][:version] end execute "passenger_module" do command 'echo -en "nnnn" | passenger-install-apache2-module' creates node[:passenger][:module_path] end
  38. a data science platform
  39. dataspaces Further reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
  40. accept all data formats
  41. evolve APIs
  42. beyond the database and the data warehouse
  43. move compute to the data
  44. data is a royal garden
  45. compute is a fungible commodity
  46. “I terminate the instance and relaunch it. Thats my error handling” Source: @jtimberman on Twitter
  47. the cloud is an architectural and cultural fit for data science
  48. amazon web services
  49. your data science platform
  50. s3://1000genomes
  51. Credit: Angel Pizzaro, U. Penn
  52. http://usegalaxy.org/cloud
  53. mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
  54. AWS knows massively scalable infrastructure
  55. you know the needs of the science
  56. we can make this work together
  57. deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci Inspiration and ideas from Matt Wood, James Hamilton & Larry Lessig Credit” Oberazzi under a CC-BY-NC-SA license

×