0
THE MISSING MANUAL FOR DATA SCIENCE: REMIX.          RESUSE. REPRODUCE                      SPEAKER: Matt Wood            ...
The Missing Manual:                      Reproduce, Reuse, Remix                      Dr. Matt Wood                      m...
Monday, April 1, 13
Hello.Monday, April 1, 13
Monday, April 1, 13
Data.Monday, April 1, 13
Generation                        Collection & storage                      Analytics & computation                      C...
Monday, April 1, 13
Generation challenge.Monday, April 1, 13
Amazing data generators: cell phones tracking cholera in Haiti                                                            ...
Amazing data generators: social networks tracking influenza                                                      You Are Wh...
Amazing data generators: web app logs targeting advertising                                  500% return on ad spendMonday...
Monday, April 1, 13
Monday, April 1, 13
Chromosome 11 : ACTN3 : rs1815739Monday, April 1, 13
Chromosome X : rs6625163Monday, April 1, 13
Chromosome 19 : FUT2 : rs601338Monday, April 1, 13
Chromosome 2 : rs10427255Monday, April 1, 13
Chromosome 10 : rs7903146                      TYPE IIMonday, April 1, 13
Chromosome 15 : rs2472297                      +0.25Monday, April 1, 13
Monday, April 1, 13
Generation challenge.Monday, April 1, 13
Generation challenge.                                       XMonday, April 1, 13
Generation                        Collection & storage                      Analytics & computation                      C...
Generation                        Collection & storage                      Analytics & computation                      C...
Monday, April 1, 13
Utility computing.Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Remove constraints.Monday, April 1, 13
Monday, April 1, 13
Analytics challenge.Monday, April 1, 13
Analytics challenge.                                       XMonday, April 1, 13
Generation                        Collection & storage                      Analytics & computation                      C...
Monday, April 1, 13
Beautiful, unique.Monday, April 1, 13
Monday, April 1, 13
Impossible to recreate.Monday, April 1, 13
Monday, April 1, 13
Snowflake Data ScienceMonday, April 1, 13
Monday, April 1, 13
Reproducibility.Monday, April 1, 13
Monday, April 1, 13
Reproducibility scales data science.Monday, April 1, 13
Monday, April 1, 13
Reproduce. Reuse. Remix.Monday, April 1, 13
Monday, April 1, 13
Value++Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
How do we get from                        here to there?                                          IPLESF                  ...
PRINCIPLESF                      5                               O                      REPRODUCIBILITYMonday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITY                   ...
Monday, April 1, 13
Increasingly large data collections.Monday, April 1, 13
Monday, April 1, 13
Challenging to obtain and manage.Monday, April 1, 13
Monday, April 1, 13
Expensive to experiment.Monday, April 1, 13
Monday, April 1, 13
Large barrier to reproducibility.Monday, April 1, 13
Monday, April 1, 13
Move data to the users.Monday, April 1, 13
Move data to the users.                                         XMonday, April 1, 13
Monday, April 1, 13
Move tools to the data.Monday, April 1, 13
Monday, April 1, 13
Place data where it can be                         consumed by tools.Monday, April 1, 13
Monday, April 1, 13
Place tools where they                         can access data.Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
More data,                       more users,                       more uses,                      more locationsMonday, A...
Monday, April 1, 13
CostMonday, April 1, 13
Monday, April 1, 13
Force multiplier.Monday, April 1, 13
Monday, April 1, 13
Cost and complexity                       kill reproducibility.Monday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITYMonday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITY                   ...
http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.htmlMonday, April 1, 13
Monday, April 1, 13
Help overcome the suck threshold.Monday, April 1, 13
Monday, April 1, 13
Easy to embrace and extend.Monday, April 1, 13
Monday, April 1, 13
Choose the right abstraction for the user.Monday, April 1, 13
Monday, April 1, 13
$ ec2-run-instancesMonday, April 1, 13
Monday, April 1, 13
$ starcluster startMonday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
Package and automate.Monday, April 1, 13
Monday, April 1, 13
Expert-as-a-service.Monday, April 1, 13
Monday, April 1, 13
Monday, April 1, 13
1000 Genomes                         Project    Cloud BioLinuxMonday, April 1, 13
Monday, April 1, 13
1000 Genomes                       Project + your                       genomic data                        Illumina Bases...
Cassandra   Aegisthus                             Hadoop, Hive, Pig                                      Amazon S3        ...
Sting                                                                Microstrategy                         R          Cass...
Monday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITYMonday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITY                  3...
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, Apr...
Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, Apr...
Monday, April 1, 13
Data scientists are hackers.Monday, April 1, 13
Monday, April 1, 13
They have their own way of working.Monday, April 1, 13
Monday, April 1, 13
Beware the Big Red Button.Monday, April 1, 13
Monday, April 1, 13
Fire and forget reproduction                      is a good first step, but limits                            longer term v...
Monday, April 1, 13
Monolithic, one-stop-shop.Monday, April 1, 13
Monday, April 1, 13
Work well for intended purpose.Monday, April 1, 13
Monday, April 1, 13
Challenging to install,                       dependency heavy.Monday, April 1, 13
Monday, April 1, 13
Difficult to grok.Monday, April 1, 13
Monday, April 1, 13
Data scientists are hackers:                              embrace it.Monday, April 1, 13
Monday, April 1, 13
Small things. Loosely coupled.Monday, April 1, 13
Monday, April 1, 13
Easier to grok, reuse and integrate.Monday, April 1, 13
Monday, April 1, 13
Lower barrier to entry.Monday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITYMonday, April 1, 13
PRINCIPLESF                      5                               O                      REPRODUCIBILITY                   ...
Monday, April 1, 13
Workflows are memes.Monday, April 1, 13
Monday, April 1, 13
Reproduction is just the first step.Monday, April 1, 13
Monday, April 1, 13
Bill of materials:                      code, data, configuration, infrastructure.Monday, April 1, 13
Monday, April 1, 13
Full definition for reproduction.Monday, April 1, 13
Monday, April 1, 13
Utility computing provides a                      playground for data science.Monday, April 1, 13
Code + AMI +                      custom datasets + public datasets +                       databases + compute + result d...
Code + AMI +                      custom datasets + public datasets +                       databases + compute + result d...
Code + AMI +                      custom datasets + public datasets +                       databases + compute + result d...
Code + AMI +                      custom datasets + public datasets +                       databases + compute + result d...
PRINCIPLESF                      5                               O                      REPRODUCIBILITYMonday, April 1, 13
PRINCIPLESF                       5                                 O                       REPRODUCIBILITY               ...
Monday, April 1, 13
Versioning becomes really important.Monday, April 1, 13
Monday, April 1, 13
Especially in an active community.Monday, April 1, 13
Monday, April 1, 13
Doubly so with loosely coupled tools.Monday, April 1, 13
Monday, April 1, 13
Provenance metadata is a                          first class entity.Monday, April 1, 13
Monday, April 1, 13
Distributed provenance.Monday, April 1, 13
IPLESF                      5                      PRI NC    O                                    Y                       ...
IPLESF                                      5 PRI NC    O                                                      Y          ...
Monday, April 1, 13
Thank you                      matthew@amazon.com                        aws.amazon.com                            @mzaMon...
Monday, April 1, 13
Upcoming SlideShare
Loading in...5
×

THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013

983

Published on

Presentation from Matt Wood, Amazon Web Services
#dataconf
More at http://event.gigaom.com/structuredata/

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
983
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE from Structure:Data 2013"

  1. 1. THE MISSING MANUAL FOR DATA SCIENCE: REMIX. RESUSE. REPRODUCE SPEAKER: Matt Wood Principal Data Scientist Amazon Web ServicesMonday, April 1, 13
  2. 2. The Missing Manual: Reproduce, Reuse, Remix Dr. Matt Wood matthew@amazon.com @mzaMonday, April 1, 13
  3. 3. Monday, April 1, 13
  4. 4. Hello.Monday, April 1, 13
  5. 5. Monday, April 1, 13
  6. 6. Data.Monday, April 1, 13
  7. 7. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  8. 8. Monday, April 1, 13
  9. 9. Generation challenge.Monday, April 1, 13
  10. 10. Amazing data generators: cell phones tracking cholera in Haiti Linus Bengtsson et al. PLoS Medicine, 2011Monday, April 1, 13
  11. 11. Amazing data generators: social networks tracking influenza You Are What You Tweet: Analyzing Twitter for Public Health. M. J. Paul and M. Dredze, 2011Monday, April 1, 13
  12. 12. Amazing data generators: web app logs targeting advertising 500% return on ad spendMonday, April 1, 13
  13. 13. Monday, April 1, 13
  14. 14. Monday, April 1, 13
  15. 15. Chromosome 11 : ACTN3 : rs1815739Monday, April 1, 13
  16. 16. Chromosome X : rs6625163Monday, April 1, 13
  17. 17. Chromosome 19 : FUT2 : rs601338Monday, April 1, 13
  18. 18. Chromosome 2 : rs10427255Monday, April 1, 13
  19. 19. Chromosome 10 : rs7903146 TYPE IIMonday, April 1, 13
  20. 20. Chromosome 15 : rs2472297 +0.25Monday, April 1, 13
  21. 21. Monday, April 1, 13
  22. 22. Generation challenge.Monday, April 1, 13
  23. 23. Generation challenge. XMonday, April 1, 13
  24. 24. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  25. 25. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  26. 26. Monday, April 1, 13
  27. 27. Utility computing.Monday, April 1, 13
  28. 28. Monday, April 1, 13
  29. 29. Monday, April 1, 13
  30. 30. Monday, April 1, 13
  31. 31. Remove constraints.Monday, April 1, 13
  32. 32. Monday, April 1, 13
  33. 33. Analytics challenge.Monday, April 1, 13
  34. 34. Analytics challenge. XMonday, April 1, 13
  35. 35. Generation Collection & storage Analytics & computation Collaboration & sharingMonday, April 1, 13
  36. 36. Monday, April 1, 13
  37. 37. Beautiful, unique.Monday, April 1, 13
  38. 38. Monday, April 1, 13
  39. 39. Impossible to recreate.Monday, April 1, 13
  40. 40. Monday, April 1, 13
  41. 41. Snowflake Data ScienceMonday, April 1, 13
  42. 42. Monday, April 1, 13
  43. 43. Reproducibility.Monday, April 1, 13
  44. 44. Monday, April 1, 13
  45. 45. Reproducibility scales data science.Monday, April 1, 13
  46. 46. Monday, April 1, 13
  47. 47. Reproduce. Reuse. Remix.Monday, April 1, 13
  48. 48. Monday, April 1, 13
  49. 49. Value++Monday, April 1, 13
  50. 50. Monday, April 1, 13
  51. 51. Monday, April 1, 13
  52. 52. How do we get from here to there? IPLESF 5 PR INC O REPRO DUCIBILITYMonday, April 1, 13
  53. 53. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  54. 54. PRINCIPLESF 5 O REPRODUCIBILITY 1. Data has GravityMonday, April 1, 13
  55. 55. Monday, April 1, 13
  56. 56. Increasingly large data collections.Monday, April 1, 13
  57. 57. Monday, April 1, 13
  58. 58. Challenging to obtain and manage.Monday, April 1, 13
  59. 59. Monday, April 1, 13
  60. 60. Expensive to experiment.Monday, April 1, 13
  61. 61. Monday, April 1, 13
  62. 62. Large barrier to reproducibility.Monday, April 1, 13
  63. 63. Monday, April 1, 13
  64. 64. Move data to the users.Monday, April 1, 13
  65. 65. Move data to the users. XMonday, April 1, 13
  66. 66. Monday, April 1, 13
  67. 67. Move tools to the data.Monday, April 1, 13
  68. 68. Monday, April 1, 13
  69. 69. Place data where it can be consumed by tools.Monday, April 1, 13
  70. 70. Monday, April 1, 13
  71. 71. Place tools where they can access data.Monday, April 1, 13
  72. 72. Monday, April 1, 13
  73. 73. Monday, April 1, 13
  74. 74. Monday, April 1, 13
  75. 75. Monday, April 1, 13
  76. 76. Monday, April 1, 13
  77. 77. More data, more users, more uses, more locationsMonday, April 1, 13
  78. 78. Monday, April 1, 13
  79. 79. CostMonday, April 1, 13
  80. 80. Monday, April 1, 13
  81. 81. Force multiplier.Monday, April 1, 13
  82. 82. Monday, April 1, 13
  83. 83. Cost and complexity kill reproducibility.Monday, April 1, 13
  84. 84. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  85. 85. PRINCIPLESF 5 O REPRODUCIBILITY 2. Ease of use is a prerequisiteMonday, April 1, 13
  86. 86. http://headrush.typepad.com/creating_passionate_users/2005/10/getting_users_p.htmlMonday, April 1, 13
  87. 87. Monday, April 1, 13
  88. 88. Help overcome the suck threshold.Monday, April 1, 13
  89. 89. Monday, April 1, 13
  90. 90. Easy to embrace and extend.Monday, April 1, 13
  91. 91. Monday, April 1, 13
  92. 92. Choose the right abstraction for the user.Monday, April 1, 13
  93. 93. Monday, April 1, 13
  94. 94. $ ec2-run-instancesMonday, April 1, 13
  95. 95. Monday, April 1, 13
  96. 96. $ starcluster startMonday, April 1, 13
  97. 97. Monday, April 1, 13
  98. 98. Monday, April 1, 13
  99. 99. Package and automate.Monday, April 1, 13
  100. 100. Monday, April 1, 13
  101. 101. Expert-as-a-service.Monday, April 1, 13
  102. 102. Monday, April 1, 13
  103. 103. Monday, April 1, 13
  104. 104. 1000 Genomes Project Cloud BioLinuxMonday, April 1, 13
  105. 105. Monday, April 1, 13
  106. 106. 1000 Genomes Project + your genomic data Illumina BasespaceMonday, April 1, 13
  107. 107. Cassandra Aegisthus Hadoop, Hive, Pig Amazon S3 Legacy data warehousing http://www.youtube.com/watch?v=oGcZ7WVx6EIMonday, April 1, 13
  108. 108. Sting Microstrategy R Cassandra Aegisthus Hadoop, Hive, Pig Amazon S3 Legacy data warehousing http://www.youtube.com/watch?v=oGcZ7WVx6EIMonday, April 1, 13
  109. 109. Monday, April 1, 13
  110. 110. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  111. 111. PRINCIPLESF 5 O REPRODUCIBILITY 3. Reuse is as important as reproductionMonday, April 1, 13
  112. 112. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, April 1, 13
  113. 113. Seven Deadly sins of Bioinformatics: http://www.slideshare.net/dullhunk/the-seven-deadly-sins-of-bioinformaticsMonday, April 1, 13
  114. 114. Monday, April 1, 13
  115. 115. Data scientists are hackers.Monday, April 1, 13
  116. 116. Monday, April 1, 13
  117. 117. They have their own way of working.Monday, April 1, 13
  118. 118. Monday, April 1, 13
  119. 119. Beware the Big Red Button.Monday, April 1, 13
  120. 120. Monday, April 1, 13
  121. 121. Fire and forget reproduction is a good first step, but limits longer term value.Monday, April 1, 13
  122. 122. Monday, April 1, 13
  123. 123. Monolithic, one-stop-shop.Monday, April 1, 13
  124. 124. Monday, April 1, 13
  125. 125. Work well for intended purpose.Monday, April 1, 13
  126. 126. Monday, April 1, 13
  127. 127. Challenging to install, dependency heavy.Monday, April 1, 13
  128. 128. Monday, April 1, 13
  129. 129. Difficult to grok.Monday, April 1, 13
  130. 130. Monday, April 1, 13
  131. 131. Data scientists are hackers: embrace it.Monday, April 1, 13
  132. 132. Monday, April 1, 13
  133. 133. Small things. Loosely coupled.Monday, April 1, 13
  134. 134. Monday, April 1, 13
  135. 135. Easier to grok, reuse and integrate.Monday, April 1, 13
  136. 136. Monday, April 1, 13
  137. 137. Lower barrier to entry.Monday, April 1, 13
  138. 138. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  139. 139. PRINCIPLESF 5 O REPRODUCIBILITY 4. Build for collaborationMonday, April 1, 13
  140. 140. Monday, April 1, 13
  141. 141. Workflows are memes.Monday, April 1, 13
  142. 142. Monday, April 1, 13
  143. 143. Reproduction is just the first step.Monday, April 1, 13
  144. 144. Monday, April 1, 13
  145. 145. Bill of materials: code, data, configuration, infrastructure.Monday, April 1, 13
  146. 146. Monday, April 1, 13
  147. 147. Full definition for reproduction.Monday, April 1, 13
  148. 148. Monday, April 1, 13
  149. 149. Utility computing provides a playground for data science.Monday, April 1, 13
  150. 150. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  151. 151. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  152. 152. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  153. 153. Code + AMI + custom datasets + public datasets + databases + compute + result dataMonday, April 1, 13
  154. 154. PRINCIPLESF 5 O REPRODUCIBILITYMonday, April 1, 13
  155. 155. PRINCIPLESF 5 O REPRODUCIBILITY 5. Provenance is a first class objectMonday, April 1, 13
  156. 156. Monday, April 1, 13
  157. 157. Versioning becomes really important.Monday, April 1, 13
  158. 158. Monday, April 1, 13
  159. 159. Especially in an active community.Monday, April 1, 13
  160. 160. Monday, April 1, 13
  161. 161. Doubly so with loosely coupled tools.Monday, April 1, 13
  162. 162. Monday, April 1, 13
  163. 163. Provenance metadata is a first class entity.Monday, April 1, 13
  164. 164. Monday, April 1, 13
  165. 165. Distributed provenance.Monday, April 1, 13
  166. 166. IPLESF 5 PRI NC O Y RODUCIBILIT REPMonday, April 1, 13
  167. 167. IPLESF 5 PRI NC O Y RODUCIBILIT REP 1. Data has gravity 2. Ease of use is a prerequisite 3. Reuse is as important as reproduction 4. Build for collaboration 5. Provenance is a first class objectMonday, April 1, 13
  168. 168. Monday, April 1, 13
  169. 169. Thank you matthew@amazon.com aws.amazon.com @mzaMonday, April 1, 13
  170. 170. Monday, April 1, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×