The Big Data       Exploratorium       A guided tour of open source       data analysis tools       Noah Pepper (@noahmp) ...
Hi,       • We’re here because...       • We are...       • Data Exploration Is...             • Example 1: Patents       ...
Hi,                   • Exploratorium #1                          • Patent citation networks                             •...
Hi,     • Get the code & data samples:     • git clone git@github.com:peppern/exploratorium.gitThursday, June 23, 2011    ...
We’re here because...       • There is a really amazing OSS community in the data space.       • This is fantastic news fo...
We are...                            Noah Pepper - @noahmp                           Devin Chalmers - @qwzybug            ...
We Build Data Exploration Tools!                                     map.clearhealthcosts.comThursday, June 23, 2011      ...
What is data exploration and what is an exploratorium       • Narrow Definition       • Why do I say                       ...
Data Exploration Example             • study evolution of technology in patent records                   – technology is a...
Patent NetworksThursday, June 23, 2011   10
Citation Analysis of PatentsThursday, June 23, 2011               11
Time Series Text AnalysisThursday, June 23, 2011            12
Some explorations are more open endedThursday, June 23, 2011                                           13
Pointwise Mutual Information (PMI)          # patents that contain words x and yThursday, June 23, 2011                   ...
PMI distributions         - see clusters         - different kinds           of clustersThursday, June 23, 2011      15
PMI Comparison: Plotting a different way                                         “the”                                    ...
btw, these are older graphs, now we use ggplot2Thursday, June 23, 2011                                     17
Previous Work in Health Care...                 500,000                 400,000 Bill   volume                             ...
Previous Work in Health Care...                              120,000        Bill  volume                              100,...
Health Care Data & Code Samples...                              ...Hahaha Just KiddingThursday, June 23, 2011             ...
But actually:       • Qmedtrix R&D team members made source contributions, see:             • Homer Strong https://github....
Exploratorium #1 Patent Networks   citations  amongst   top 10k  most cited   patentsThursday, June 23, 2011              ...
Grab the graph data:                          ~/exploratorium/patents/toplinks.dot                                        ...
GraphViz Can           Graph really big          graphs... but they          get hard to use ->                           ...
Graphviz - Play with Graphs       (http://www.graphviz.org)       • sudo port install graphviz or sudo apt-get install gra...
Styling dots       • 	 node [shape=point, width="0.15",color="#0000001c"];       • 	 edge [arrowsize="0.50", color="#00000...
Thursday, June 23, 2011   27
UbiGraph       • We loved UbiGraph, but don’t know an OSS alternative       • Renders many nodes in 3D in realtime FD-layo...
Exploratorium #2       • Making graphs of language using python, redis, R and a bunch of awesome         libraries       •...
...how?       Mine — Munge — VisualizeThursday, June 23, 2011           30
...how?       github.com/peppern/exploratorium       [ brew | apt-get | port ] install redis       www.r-project.org      ...
Best show on TVThursday, June 23, 2011   32
Best show on TVThursday, June 23, 2011   32
Best show on TVThursday, June 23, 2011   32
Best show on TVThursday, June 23, 2011   32
Best show on TVThursday, June 23, 2011   33
Mine the data       • gutenberg.org       • google.com/ngrams       • APIs — Twitter, etc.       • http://code.google.com/...
Store the dataThursday, June 23, 2011   35
Store the data                          Postgres is not too shabbyThursday, June 23, 2011                                35
Store the data            SELECT cite AS patent_num, count FROM (SELECT cite,            count(*) AS count FROM citations ...
Store the data            SELECT `cite`, count(*), `year` FROM `citations`            INNER JOIN (SELECT date_part(year, `...
Store the data            SELECT term, count FROM (SELECT term, count(*) FROM            (SELECT patent_num, term FROM tfi...
Store the dataThursday, June 23, 2011   39
Store the data                          NoSQL is a good fit for web dataThursday, June 23, 2011                            ...
Reshape the dataThursday, June 23, 2011   41
Reshape the data                          citer   citee                           a       b                           c   ...
Reshape the data                             citer   citee                              a       b                         ...
Reshape the data                             citer   citee                              a       b                         ...
Redis                          In-Memory Data Structure ServerThursday, June 23, 2011                                     42
RedisThursday, June 23, 2011   43
Redis       • HSET key name value       • SADD key value       • ZUNIONSTORE       • HSETNX       • BRPOPLPUSH       •…Thu...
RedisThursday, June 23, 2011   45
Redis                          Global variable for all your programsThursday, June 23, 2011                               ...
Redis                          Global variable for all your programs                              Memcached with structure...
Redis                          Global variable for all your programs                              Memcached with structure...
Redis                          Global variable for all your programs                              Memcached with structure...
Redis                          Global variable for all your programs                              Memcached with structure...
Redis                          Global variable for all your programs                              Memcached with structure...
RedditThursday, June 23, 2011   49
RedditThursday, June 23, 2011   49
RedditThursday, June 23, 2011   50
Reddit       • Count words by hourThursday, June 23, 2011        50
Reddit       • Count words by hour       • Comment networkThursday, June 23, 2011        50
Reddit       • Count words by hour       • Comment network       • User networkThursday, June 23, 2011        50
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12       • Comment network       • User networkThursday, Ju...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       • Count words by hour   ZSET subreddit:2011-06-21:12                                     word [count]       •...
Reddit       github.com/peppern/exploratorium       [ brew | apt-get | port ] install redis       www.r-project.org       ...
Reddit                          (demo)Thursday, June 23, 2011            52
Reddit                           Go forth and graph!                          #exploratorium #osb11Thursday, June 23, 2011...
Reddit                           Go forth and graph!                          #exploratorium #osb11                       ...
Reddit                           Go forth and graph!                          #exploratorium #osb11                       ...
You Are Now Leaving       the Big Data       Exploratorium       Please ensure you have your       valuables.       Noah P...
Upcoming SlideShare
Loading in...5
×

The Big Data Exploratorium

849

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
849
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
21
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The Big Data Exploratorium

  1. 1. The Big Data Exploratorium A guided tour of open source data analysis tools Noah Pepper (@noahmp) Devin Chalmers (@qwzybug) #exploratorium @osb11Thursday, June 23, 2011 1
  2. 2. Hi, • We’re here because... • We are... • Data Exploration Is... • Example 1: Patents • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008) • Example 2: Health Care • (Pepper et al. Visweek 2010)Thursday, June 23, 2011 2
  3. 3. Hi, • Exploratorium #1 • Patent citation networks • Graphviz • NetworkX • Exploratorium #2 • Reddit comment word usagesThursday, June 23, 2011 3
  4. 4. Hi, • Get the code & data samples: • git clone git@github.com:peppern/exploratorium.gitThursday, June 23, 2011 4
  5. 5. We’re here because... • There is a really amazing OSS community in the data space. • This is fantastic news for academics, hobbyists, and professionals alike. • We want to show what you can do with open source tools, show you the ones we like. • We’d love to hear about what YOUR favorites are, #exploratorium to tell us. • Data exploration is fun...Thursday, June 23, 2011 5
  6. 6. We are... Noah Pepper - @noahmp Devin Chalmers - @qwzybug • Academic Data Junkies • We’re Sorta Lucky Our academic home. Research focuses on on exploring the nature Our startup of evolutionary where we build data activity through data exploration mining platformsThursday, June 23, 2011 6
  7. 7. We Build Data Exploration Tools! map.clearhealthcosts.comThursday, June 23, 2011 7
  8. 8. What is data exploration and what is an exploratorium • Narrow Definition • Why do I say visualization instead of the more • Data exploration is general having an iterative ‘representation’? relationship with your data, analysis, and visualization exploratorium noun [usu. in names ] stack where you a scientific museum or similar center at which visitors have the build an intuitive opportunity of performing prearranged experiments or demonstrations. cognitive model of the information Yes! That means visualized. there’s code and dataThursday, June 23, 2011 8
  9. 9. Data Exploration Example • study evolution of technology in patent records – technology is a window on culture – patents are a window on technologyThursday, June 23, 2011 9
  10. 10. Patent NetworksThursday, June 23, 2011 10
  11. 11. Citation Analysis of PatentsThursday, June 23, 2011 11
  12. 12. Time Series Text AnalysisThursday, June 23, 2011 12
  13. 13. Some explorations are more open endedThursday, June 23, 2011 13
  14. 14. Pointwise Mutual Information (PMI) # patents that contain words x and yThursday, June 23, 2011 14
  15. 15. PMI distributions - see clusters - different kinds of clustersThursday, June 23, 2011 15
  16. 16. PMI Comparison: Plotting a different way “the” PMI integral halfway rank “optical” - generality of content? “cultivar”Thursday, June 23, 2011 16
  17. 17. btw, these are older graphs, now we use ggplot2Thursday, June 23, 2011 17
  18. 18. Previous Work in Health Care... 500,000 400,000 Bill   volume Placement  in distribution  of  billed 300,000 Upper  5% 200,000 Bottom  5% 100,000 0 AMB ASC DME ER IPH OPH PRO Adjudication  type .... with @homerstrong at Qmedtrix Systems Inc.Thursday, June 23, 2011 18
  19. 19. Previous Work in Health Care... 120,000 Bill  volume 100,000 80,000 60,000 40,000 20,000 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 1.4e+09 1.2e+09 Dollar  density 1.0e+09 8.0e+08 Billed 6.0e+08 First  Audit 4.0e+08 Second  Audit 2.0e+08 0.0e+00 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Amount  ($) ... @hadleywickham is a #ballR http://had.co.nzThursday, June 23, 2011 19
  20. 20. Health Care Data & Code Samples... ...Hahaha Just KiddingThursday, June 23, 2011 20
  21. 21. But actually: • Qmedtrix R&D team members made source contributions, see: • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort) • Kevin Lynagh https://github.com/lynaghk (Keming Labs)Thursday, June 23, 2011 21
  22. 22. Exploratorium #1 Patent Networks citations amongst top 10k most cited patentsThursday, June 23, 2011 22
  23. 23. Grab the graph data: ~/exploratorium/patents/toplinks.dot Graphviz Art is Pretty!Thursday, June 23, 2011 23
  24. 24. GraphViz Can Graph really big graphs... but they get hard to use -> <- Psychedelic PatentsThursday, June 23, 2011 24
  25. 25. Graphviz - Play with Graphs (http://www.graphviz.org) • sudo port install graphviz or sudo apt-get install graphviz • graphing commands: dot,neato,twopi,circo,fdp • dot -Tpdf -o file.dot • More options here: • http://www.graphviz.org/content/command-line-invocation • Fun options are in the .dot file: • http://www.graphviz.org/content/dot-languageThursday, June 23, 2011 25
  26. 26. Styling dots • node [shape=point, width="0.15",color="#0000001c"]; • edge [arrowsize="0.50", color="#0000001c"]; • There are tons, read the docs and have fun • You can also try more complex things • Like constraints, time for example • Sometimes too many constraints makes GraphViz unhappy...Thursday, June 23, 2011 26
  27. 27. Thursday, June 23, 2011 27
  28. 28. UbiGraph • We loved UbiGraph, but don’t know an OSS alternative • Renders many nodes in 3D in realtime FD-layout (50k+). • 16gb of ram Mac Pro • Shout out to Apple: thank you for supporting our research! • It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it! • Alternatives?Thursday, June 23, 2011 28
  29. 29. Exploratorium #2 • Making graphs of language using python, redis, R and a bunch of awesome libraries • Thanks • @hadleywickham • @homerstrong • @antirez • Bryan Lewis (http://illposed.net/)Thursday, June 23, 2011 29
  30. 30. ...how? Mine — Munge — VisualizeThursday, June 23, 2011 30
  31. 31. ...how? github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR packageThursday, June 23, 2011 31
  32. 32. Best show on TVThursday, June 23, 2011 32
  33. 33. Best show on TVThursday, June 23, 2011 32
  34. 34. Best show on TVThursday, June 23, 2011 32
  35. 35. Best show on TVThursday, June 23, 2011 32
  36. 36. Best show on TVThursday, June 23, 2011 33
  37. 37. Mine the data • gutenberg.org • google.com/ngrams • APIs — Twitter, etc. • http://code.google.com/apis/socialgraph/ • ScrapeThursday, June 23, 2011 34
  38. 38. Store the dataThursday, June 23, 2011 35
  39. 39. Store the data Postgres is not too shabbyThursday, June 23, 2011 35
  40. 40. Store the data SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10Thursday, June 23, 2011 36
  41. 41. Store the data SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part(year, `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite`Thursday, June 23, 2011 37
  42. 42. Store the data SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > 1990-01-01 AND grantdate < 2000-01-01)) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50;Thursday, June 23, 2011 38
  43. 43. Store the dataThursday, June 23, 2011 39
  44. 44. Store the data NoSQL is a good fit for web dataThursday, June 23, 2011 40
  45. 45. Reshape the dataThursday, June 23, 2011 41
  46. 46. Reshape the data citer citee a b c b b dThursday, June 23, 2011 41
  47. 47. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] }Thursday, June 23, 2011 41
  48. 48. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }Thursday, June 23, 2011 41
  49. 49. Redis In-Memory Data Structure ServerThursday, June 23, 2011 42
  50. 50. RedisThursday, June 23, 2011 43
  51. 51. Redis • HSET key name value • SADD key value • ZUNIONSTORE • HSETNX • BRPOPLPUSH •…Thursday, June 23, 2011 44
  52. 52. RedisThursday, June 23, 2011 45
  53. 53. Redis Global variable for all your programsThursday, June 23, 2011 45
  54. 54. Redis Global variable for all your programs Memcached with structureThursday, June 23, 2011 45
  55. 55. Redis Global variable for all your programs Memcached with structure Really fastThursday, June 23, 2011 45
  56. 56. Redis Global variable for all your programs Memcached with structure Really really fastThursday, June 23, 2011 46
  57. 57. Redis Global variable for all your programs Memcached with structure Really, really, astonishingly fastThursday, June 23, 2011 47
  58. 58. Redis Global variable for all your programs Memcached with structure No, faster than thatThursday, June 23, 2011 48
  59. 59. RedditThursday, June 23, 2011 49
  60. 60. RedditThursday, June 23, 2011 49
  61. 61. RedditThursday, June 23, 2011 50
  62. 62. Reddit • Count words by hourThursday, June 23, 2011 50
  63. 63. Reddit • Count words by hour • Comment networkThursday, June 23, 2011 50
  64. 64. Reddit • Count words by hour • Comment network • User networkThursday, June 23, 2011 50
  65. 65. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 • Comment network • User networkThursday, June 23, 2011 50
  66. 66. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network • User networkThursday, June 23, 2011 50
  67. 67. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments • User networkThursday, June 23, 2011 50
  68. 68. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User networkThursday, June 23, 2011 50
  69. 69. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:usersThursday, June 23, 2011 50
  70. 70. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id”Thursday, June 23, 2011 50
  71. 71. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threadsThursday, June 23, 2011 50
  72. 72. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads thread_idThursday, June 23, 2011 50
  73. 73. Reddit github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR packageThursday, June 23, 2011 51
  74. 74. Reddit (demo)Thursday, June 23, 2011 52
  75. 75. Reddit Go forth and graph! #exploratorium #osb11Thursday, June 23, 2011 53
  76. 76. Reddit Go forth and graph! #exploratorium #osb11 We will hire you.Thursday, June 23, 2011 53
  77. 77. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. For reals.Thursday, June 23, 2011 53
  78. 78. You Are Now Leaving the Big Data Exploratorium Please ensure you have your valuables. Noah Pepper @noahmp Devin Chalmers @qwzybug #exploratorium #osb11Thursday, June 23, 2011 54
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×