THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013

1,108 views
1,076 views

Published on

Presentation from Matt Wood, Amazon Web Services
#dataconf
More at http://event.gigaom.com/structuredata/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,108
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013

  1. 1. THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE SPEAKER: Matt Wood Principal Data Scientist Amazon Web ServicesMonday, April 1, 13
  2. 2. The Missing Manual: Reproduce, Reuse, Remix Dr. Matt Wood matthew@amazon.com @mzaMonday, April 1, 13
  3. 3. Monday, April 1, 13
  4. 4. Hello.Monday, April 1, 13
  5. 5. Monday, April 1, 13
  6. 6. Data.Monday, April 1, 13
  7. 7. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  8. 8. Monday, April 1, 13
  9. 9. Generation challenge.Monday, April 1, 13
  10. 10. Amazing data generators: cell phones tracking cholera in Haiti Linus Bengtsson et al. PLoS Medicine, 2011Monday, April 1, 13
  11. 11. Amazing data generators: social networks tracking influenza You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011Monday, April 1, 13
  12. 12. Amazing data generators: web app logs targeting advertising 500% return on ad spendMonday, April 1, 13
  13. 13. Monday, April 1, 13
  14. 14. Monday, April 1, 13
  15. 15. Chromosome 11 : ACTN3 : rs1815739Monday, April 1, 13
  16. 16. Chromosome X : rs6625163Monday, April 1, 13
  17. 17. Chromosome 19 : FUT2 : rs601338Monday, April 1, 13
  18. 18. Chromosome 2 : rs10427255Monday, April 1, 13
  19. 19. Chromosome 10 : rs7903146 TYPE IIMonday, April 1, 13
  20. 20. Chromosome 15 : rs2472297 +0.25Monday, April 1, 13
  21. 21. Monday, April 1, 13
  22. 22. Generation challenge.Monday, April 1, 13
  23. 23. Generation challenge. XMonday, April 1, 13
  24. 24. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  25. 25. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  26. 26. Monday, April 1, 13
  27. 27. Utility computing.Monday, April 1, 13
  28. 28. Monday, April 1, 13
  29. 29. Monday, April 1, 13
  30. 30. Monday, April 1, 13
  31. 31. Remove constraints.Monday, April 1, 13
  32. 32. Monday, April 1, 13
  33. 33. Analytics challenge.Monday, April 1, 13
  34. 34. Analytics challenge. XMonday, April 1, 13
  35. 35. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  36. 36. Monday, April 1, 13
  37. 37. Beautiful, unique.Monday, April 1, 13
  38. 38. Monday, April 1, 13
  39. 39. Impossible to recreate.Monday, April 1, 13
  40. 40. Monday, April 1, 13
  41. 41. Snowflake Data ScienceMonday, April 1, 13
  42. 42. Monday, April 1, 13
  43. 43. Reproducibility.Monday, April 1, 13
  44. 44. Monday, April 1, 13
  45. 45. Reproducibility scales data science.Monday, April 1, 13
  46. 46. Monday, April 1, 13
  47. 47. Reproduce. Reuse. Remix.Monday, April 1, 13
  48. 48. Monday, April 1, 13
  49. 49. Value++Monday, April 1, 13
  50. 50. Monday, April 1, 13
  51. 51. Monday, April 1, 13
  52. 52. How do we get from here to there? IPLESF 5 PR INC O REPRO DUCIBILITYMonday, April 1, 13
  53. 53. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  54. 54. PRINCIPLESF 5 O REPRODUCIBILITY 1. Data has GravityMonday, April 1, 13
  55. 55. Monday, April 1, 13
  56. 56. Increasingly large data collections.Monday, April 1, 13
  57. 57. Monday, April 1, 13
  58. 58. Challenging to obtain and manage.Monday, April 1, 13
  59. 59. Monday, April 1, 13
  60. 60. Expensive to experiment.Monday, April 1, 13
  61. 61. Monday, April 1, 13
  62. 62. Large barrier to reproducibility.Monday, April 1, 13
  63. 63. Monday, April 1, 13
  64. 64. Move data to the users.Monday, April 1, 13
  65. 65. Move data to the users. XMonday, April 1, 13
  66. 66. Monday, April 1, 13
  67. 67. Move tools to the data.Monday, April 1, 13
  68. 68. Monday, April 1, 13
  69. 69. Place data where it can be consumed by tools.Monday, April 1, 13
  70. 70. Monday, April 1, 13
  71. 71. Place tools where they can access data.Monday, April 1, 13
  72. 72. Monday, April 1, 13
  73. 73. Monday, April 1, 13
  74. 74. Monday, April 1, 13
  75. 75. Monday, April 1, 13
  76. 76. Monday, April 1, 13
  77. 77. More data, more users, more uses, more locationsMonday, April 1, 13
  78. 78. Monday, April 1, 13
  79. 79. CostMonday, April 1, 13
  80. 80. Monday, April 1, 13
  81. 81. Force multiplier.Monday, April 1, 13
  82. 82. Monday, April 1, 13
  83. 83. Cost and complexity kill reproducibility.Monday, April 1, 13
  84. 84. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  85. 85. PRINCIPLESF 5 O REPRODUCIBILITY 2. Ease of use is a prerequisiteMonday, April 1, 13
  86. 86. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.htmlMonday, April 1, 13
  87. 87. Monday, April 1, 13
  88. 88. Help overcome the suck threshold.Monday, April 1, 13
  89. 89. Monday, April 1, 13
  90. 90. Easy to embrace and extend.Monday, April 1, 13
  91. 91. Monday, April 1, 13
  92. 92. Choose the right abstraction for the user.Monday, April 1, 13
  93. 93. Monday, April 1, 13
  94. 94. $ ec2-run-instancesMonday, April 1, 13
  95. 95. Monday, April 1, 13
  96. 96. $ starcluster startMonday, April 1, 13
  97. 97. Monday, April 1, 13
  98. 98. Monday, April 1, 13
  99. 99. Package and automate.Monday, April 1, 13
  100. 100. Monday, April 1, 13
  101. 101. Expert-as-a-service.Monday, April 1, 13
  102. 102. Monday, April 1, 13
  103. 103. Monday, April 1, 13
  104. 104. 1000 Genomes Project Cloud BioLinuxMonday, April 1, 13
  105. 105. Monday, April 1, 13
  106. 106. 1000 Genomes Project + your genomic data Illumina BasespaceMonday, April 1, 13
  107. 107. Cassandra Aegisthus Hadoop, Hive, Pig Amazon S3 Legacy data warehousing http://www.youtube.com/watch?v=oGcZ7WVx6EIMonday, April 1, 13
  108. 108. Sting Microstrategy R Cassandra Aegisthus Hadoop, Hive, Pig Amazon S3 Legacy data warehousing http://www.youtube.com/watch?v=oGcZ7WVx6EIMonday, April 1, 13
  109. 109. Monday, April 1, 13
  110. 110. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  111. 111. PRINCIPLESF 5 O REPRODUCIBILITY 3. Reuse is as important as reproductionMonday, April 1, 13
  112. 112. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, April 1, 13
  113. 113. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, April 1, 13
  114. 114. Monday, April 1, 13
  115. 115. Data scientists are hackers.Monday, April 1, 13
  116. 116. Monday, April 1, 13
  117. 117. They have their own way of working.Monday, April 1, 13
  118. 118. Monday, April 1, 13
  119. 119. Beware the Big Red Button.Monday, April 1, 13
  120. 120. Monday, April 1, 13
  121. 121. Fire and forget reproduction is a good first step, but limits longer term value.Monday, April 1, 13
  122. 122. Monday, April 1, 13
  123. 123. Monolithic, one-stop-shop.Monday, April 1, 13
  124. 124. Monday, April 1, 13
  125. 125. Work well for intended purpose.Monday, April 1, 13
  126. 126. Monday, April 1, 13
  127. 127. Challenging to install, dependency heavy.Monday, April 1, 13
  128. 128. Monday, April 1, 13
  129. 129. Difficult to grok.Monday, April 1, 13
  130. 130. Monday, April 1, 13
  131. 131. Data scientists are hackers: embrace it.Monday, April 1, 13
  132. 132. Monday, April 1, 13
  133. 133. Small things. Loosely coupled.Monday, April 1, 13
  134. 134. Monday, April 1, 13
  135. 135. Easier to grok, reuse and integrate.Monday, April 1, 13
  136. 136. Monday, April 1, 13
  137. 137. Lower barrier to entry.Monday, April 1, 13
  138. 138. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  139. 139. PRINCIPLESF 5 O REPRODUCIBILITY 4. Build for collaborationMonday, April 1, 13
  140. 140. Monday, April 1, 13
  141. 141. Workflows are memes.Monday, April 1, 13
  142. 142. Monday, April 1, 13
  143. 143. Reproduction is just the first step.Monday, April 1, 13
  144. 144. Monday, April 1, 13
  145. 145. Bill of materials: code, data, configuration, infrastructure.Monday, April 1, 13
  146. 146. Monday, April 1, 13
  147. 147. Full definition for reproduction.Monday, April 1, 13
  148. 148. Monday, April 1, 13
  149. 149. Utility computing provides a playground for data science.Monday, April 1, 13
  150. 150. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  151. 151. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  152. 152. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  153. 153. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  154. 154. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  155. 155. PRINCIPLESF 5 O REPRODUCIBILITY 5. Provenance is a first class objectMonday, April 1, 13
  156. 156. Monday, April 1, 13
  157. 157. Versioning becomes really important.Monday, April 1, 13
  158. 158. Monday, April 1, 13
  159. 159. Especially in an active community.Monday, April 1, 13
  160. 160. Monday, April 1, 13
  161. 161. Doubly so with loosely coupled tools.Monday, April 1, 13
  162. 162. Monday, April 1, 13
  163. 163. Provenance metadata is a first class entity.Monday, April 1, 13
  164. 164. Monday, April 1, 13
  165. 165. Distributed provenance.Monday, April 1, 13
  166. 166. IPLESF 5 PRI NC O Y RODUCIBILIT REPMonday, April 1, 13
  167. 167. IPLESF 5 PRI NC O Y RODUCIBILIT REP 1. Data has gravity 2. Ease of use is a prerequisite 3. Reuse is as important as reproduction 4. Build for collaboration 5. Provenance is a first class objectMonday, April 1, 13
  168. 168. Monday, April 1, 13
  169. 169. Thank you matthew@amazon.com aws.amazon.com @mzaMonday, April 1, 13
  170. 170. Monday, April 1, 13

×