Platforms for    Data Science                               Deepak Singh       P r i n c i p a l   P r o d u c t   M a n a...
bioinformatics
collection
curation
analysis
so what?
Image:Yael Fitzpatrick (AAAS)
lots of data
lots of people
lots of places
constant change
we want to make ourdata more effective
versioning
provenance
filterVia asklar under a CC-BY license
aggregateImage: Chris Heiler
extendImage: Bethan
human interfacesImage: Sebastian Anthony
share
communicateimage: Leo Reynolds
hard problem
really hard problem
so how doget there?
information platforms
Image: Drew Conway
dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
the unreasonable          effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
accept all data   formats
evolve APIs
data as aprogrammable   resource
data is aroyal garden
compute is afungible commodity
constraintseverywhere
CPU, storage,   Hardware                       memory                  Collections, datasets,Data management              ...
remove constraintsCredit: Pieter Musterd a CC-BY-NC-ND license
amazon web services
yourinfrastructure
ec2-run-instances
on demand  global secure
programmable
elastic
Netflix needed to transcode                                   17,000 titles (80TB of data) to                             ...
Source: Adrian Cockroft (Netflix)
durable
99.999999999%
I did say data was a    royal garden
performance
“Our 40-instance (m2.2xlarge) cluster canscan, filter, and aggregate 1 billion rows in950 milliseconds.”Mike Driscoll - Me...
WIEN2K Parallel                                                                    Performance                            ...
“Our tests have shown more than 90percent scaling efficiency onclusters of up to 128 GPUs each”
consumption  models
on-demand
Reserved Instances
what is the value of   your data?
the cloudsbiggest value
remove constraints
Image: Chris Dagdigian
Credit: Angel Pizzaro, U. Penn
13k sequences - 10 min - 0.1s per sequence
mapreduce for  genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml            http://contrail-bio.sourceforge....
30,472 cores
$1279/hr
http://cloudbiolinux.org/
http://usegalaxy.org/cloud
“The process of moving StarMolsim over to the cloud to support the “Introduction toModeling and Simulation” course at MIT ...
in summary
large scale datarequires a rethink
data architecture
compute architecture
distributed,programmable infrastructure
amazon web services
remove constraints
can we build datascience platforms?
there is no magicthere is only awesome
deesingh@amazon.com                                                             Twitter:@mndoci                           ...
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Talk at West Coast Association of Shared Resource Directors
Upcoming SlideShare
Loading in...5
×

Talk at West Coast Association of Shared Resource Directors

1,393

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,393
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Talk at West Coast Association of Shared Resource Directors

  1. 1. Platforms for Data Science Deepak Singh P r i n c i p a l P r o d u c t M a n a g e r
  2. 2. bioinformatics
  3. 3. collection
  4. 4. curation
  5. 5. analysis
  6. 6. so what?
  7. 7. Image:Yael Fitzpatrick (AAAS)
  8. 8. lots of data
  9. 9. lots of people
  10. 10. lots of places
  11. 11. constant change
  12. 12. we want to make ourdata more effective
  13. 13. versioning
  14. 14. provenance
  15. 15. filterVia asklar under a CC-BY license
  16. 16. aggregateImage: Chris Heiler
  17. 17. extendImage: Bethan
  18. 18. human interfacesImage: Sebastian Anthony
  19. 19. share
  20. 20. communicateimage: Leo Reynolds
  21. 21. hard problem
  22. 22. really hard problem
  23. 23. so how doget there?
  24. 24. information platforms
  25. 25. Image: Drew Conway
  26. 26. dataspacesFurther reading: Jeff Hammerbacher, Information Platforms and the rise of the data scientist, Beautiful Data
  27. 27. the unreasonable effectiveness of dataHalevy, et al. IEEE Intelligent Systems, 24, 8-12 (2009)
  28. 28. accept all data formats
  29. 29. evolve APIs
  30. 30. data as aprogrammable resource
  31. 31. data is aroyal garden
  32. 32. compute is afungible commodity
  33. 33. constraintseverywhere
  34. 34. CPU, storage, Hardware memory Collections, datasets,Data management provenance parallelization, Software optimization Backup, redundant, Availability replicated Cost Small
  35. 35. remove constraintsCredit: Pieter Musterd a CC-BY-NC-ND license
  36. 36. amazon web services
  37. 37. yourinfrastructure
  38. 38. ec2-run-instances
  39. 39. on demand global secure
  40. 40. programmable
  41. 41. elastic
  42. 42. Netflix needed to transcode 17,000 titles (80TB of data) to support the launch of Sony PS3. They provisioned 1200 Amazon EC2 instances and completed the transcoding process in just days.Source: Adrian Cockroft (Netflix)
  43. 43. Source: Adrian Cockroft (Netflix)
  44. 44. durable
  45. 45. 99.999999999%
  46. 46. I did say data was a royal garden
  47. 47. performance
  48. 48. “Our 40-instance (m2.2xlarge) cluster canscan, filter, and aggregate 1 billion rows in950 milliseconds.”Mike Driscoll - Metamarkets
  49. 49. WIEN2K Parallel Performance H size 56,000 (25GB) Runtime (16x8 processors) Local (Infiniband) 3h:48 Cloud (10Gbps) 1h:30 ($40) 1200 atom unit cell; SCALAPACK+MPI diagonalization, matrix size 50k-100kCredit: K. Jorissen, F. D. Villa, and J. J. Rehr (U. Washington)
  50. 50. “Our tests have shown more than 90percent scaling efficiency onclusters of up to 128 GPUs each”
  51. 51. consumption models
  52. 52. on-demand
  53. 53. Reserved Instances
  54. 54. what is the value of your data?
  55. 55. the cloudsbiggest value
  56. 56. remove constraints
  57. 57. Image: Chris Dagdigian
  58. 58. Credit: Angel Pizzaro, U. Penn
  59. 59. 13k sequences - 10 min - 0.1s per sequence
  60. 60. mapreduce for genomics http://bowtie-bio.sourceforge.net/crossbow/index.shtml http://contrail-bio.sourceforge.net http://bowtie-bio.sourceforge.net/myrna/index.shtml
  61. 61. 30,472 cores
  62. 62. $1279/hr
  63. 63. http://cloudbiolinux.org/
  64. 64. http://usegalaxy.org/cloud
  65. 65. “The process of moving StarMolsim over to the cloud to support the “Introduction toModeling and Simulation” course at MIT was a huge success. The cloud enabled the STARgroup to move away from the responsibility of owning and maintaing dedicated hardwareand instead focus on their core mission of developing software and services for faculty,students, and researchers at MIT” http://web.mit.edu/stardev/cluster/about.html
  66. 66. in summary
  67. 67. large scale datarequires a rethink
  68. 68. data architecture
  69. 69. compute architecture
  70. 70. distributed,programmable infrastructure
  71. 71. amazon web services
  72. 72. remove constraints
  73. 73. can we build datascience platforms?
  74. 74. there is no magicthere is only awesome
  75. 75. deesingh@amazon.com Twitter:@mndoci http://slideshare.net/mndoci http://mndoci.com Inspiration and ideas from Matt Wood& Larry LessigCredit” Oberazzi under a CC-BY-NC-SA license

×