There is no magicThere is only awesome    Platforms for data science        D e e p a k   S i n g h
bioinformaticsimage: Ethan Hein
3
collection
curation
analysis
what’s the big deal?
Source: http://www.nature.com/news/specials/bigdata/index.html
Image:Yael Fitzpatrick (AAAS)
Image:Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
constant change
we want to make ourdata more effective
versioning
provenance
filter
aggregate
extend
mashup
human interfaces
image: Leo Reynolds
hard problem
really hard problem
so how doget there?
information platforms
Image: Drew Conway
dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable          effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data   formats
evolve APIs
beyond databases and the data warehouse
data as aprogrammable   resource
data is aroyal garden
compute is afungible commodity
optimizing the most valuable resource
compute, storage,   workflows, memory,transmission, algorithms,         cost, …
peopleCredit: Pieter Musterd a CC-BY-NC-ND license
Image: Chris Dagdigian
my bias
cloud services
distributed systems
scale
global
consumption  models
on-demand
what is the value of   your data?
Credit: Angel Pizzaro, U. Penn
mapreduce for  genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml            http://contrail-bio.sourceforge....
Bioproximity          http://aws.amazon.com/solutions/case-studies/bioproximity/
30,472 cores
$1279/hr
http://cloudbiolinux.org/
http://usegalaxy.org/cloud
in summary
large scale datarequires a rethink
data architecture
compute architecture
distributed,programmable infrastructure
cloud services
remove constraints
can we build datascience platforms?
there is no magicthere is only awesome
deesingh@amazon.com                                                             Twitter:@mndoci                           ...
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Platforms for Data Science - Computing on the Brink
Upcoming SlideShare
Loading in...5
×

Platforms for Data Science - Computing on the Brink

1,256

Published on

Talk at

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,256
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Platforms for Data Science - Computing on the Brink

  1. 1. There is no magicThere is only awesome Platforms for data science D e e p a k S i n g h
  2. 2. bioinformaticsimage: Ethan Hein
  3. 3. 3
  4. 4. collection
  5. 5. curation
  6. 6. analysis
  7. 7. what’s the big deal?
  8. 8. Source: http://www.nature.com/news/specials/bigdata/index.html
  9. 9. Image:Yael Fitzpatrick (AAAS)
  10. 10. Image:Yael Fitzpatrick (AAAS)
  11. 11. lots of data
  12. 12. lots of people
  13. 13. lots of places
  14. 14. constant change
  15. 15. we want to make ourdata more effective
  16. 16. versioning
  17. 17. provenance
  18. 18. filter
  19. 19. aggregate
  20. 20. extend
  21. 21. mashup
  22. 22. human interfaces
  23. 23. image: Leo Reynolds
  24. 24. hard problem
  25. 25. really hard problem
  26. 26. so how doget there?
  27. 27. information platforms
  28. 28. Image: Drew Conway
  29. 29. dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
  30. 30. the unreasonable effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
  31. 31. accept all data formats
  32. 32. evolve APIs
  33. 33. beyond databases and the data warehouse
  34. 34. data as aprogrammable resource
  35. 35. data is aroyal garden
  36. 36. compute is afungible commodity
  37. 37. optimizing the most valuable resource
  38. 38. compute, storage, workflows, memory,transmission, algorithms, cost, …
  39. 39. peopleCredit: Pieter Musterd a CC-BY-NC-ND license
  40. 40. Image: Chris Dagdigian
  41. 41. my bias
  42. 42. cloud services
  43. 43. distributed systems
  44. 44. scale
  45. 45. global
  46. 46. consumption models
  47. 47. on-demand
  48. 48. what is the value of your data?
  49. 49. Credit: Angel Pizzaro, U. Penn
  50. 50. mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
  51. 51. Bioproximity http://aws.amazon.com/solutions/case-studies/bioproximity/
  52. 52. 30,472 cores
  53. 53. $1279/hr
  54. 54. http://cloudbiolinux.org/
  55. 55. http://usegalaxy.org/cloud
  56. 56. in summary
  57. 57. large scale datarequires a rethink
  58. 58. data architecture
  59. 59. compute architecture
  60. 60. distributed,programmable infrastructure
  61. 61. cloud services
  62. 62. remove constraints
  63. 63. can we build datascience platforms?
  64. 64. there is no magicthere is only awesome
  65. 65. deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood& Larry LessigCredit” Oberazzi under a CC-BY-NC-SA license
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×