Your SlideShare is downloading. ×
0
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
The Big Data Exploratorium OSB 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

The Big Data Exploratorium OSB 2011

1,634

Published on

Slides from the 2011 Open Source Bridge talk from @noahmp and @qwzybug

Slides from the 2011 Open Source Bridge talk from @noahmp and @qwzybug

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,634
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. The Big Data Exploratorium A guided tour of open source data analysis tools Noah Pepper (@noahmp) Devin Chalmers (@qwzybug) #exploratorium @osb11Thursday, June 23, 2011 1
  • 2. Hi, • We’re here because... • We are... • Data Exploration Is... • Example 1: Patents • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008) • Example 2: Health Care • (Pepper et al. Visweek 2010)Thursday, June 23, 2011 2
  • 3. Hi, • Exploratorium #1 • Patent citation networks • Graphviz • NetworkX • Exploratorium #2 • Reddit comment word usagesThursday, June 23, 2011 3
  • 4. Hi, • Get the code & data samples: • git clone git@github.com:peppern/exploratorium.gitThursday, June 23, 2011 4
  • 5. We’re here because... • There is a really amazing OSS community in the data space. • This is fantastic news for academics, hobbyists, and professionals alike. • We want to show what you can do with open source tools, show you the ones we like. • We’d love to hear about what YOUR favorites are, #exploratorium to tell us. • Data exploration is fun...Thursday, June 23, 2011 5
  • 6. We are... Noah Pepper - @noahmp Devin Chalmers - @qwzybug • Academic Data Junkies • We’re Sorta Lucky Our academic home. Research focuses on on exploring the nature Our startup of evolutionary where we build data activity through data exploration mining platformsThursday, June 23, 2011 6
  • 7. We Build Data Exploration Tools! map.clearhealthcosts.comThursday, June 23, 2011 7
  • 8. What is data exploration and what is an exploratorium • Narrow Definition • Why do I say visualization instead of the more • Data exploration is general having an iterative ‘representation’? relationship with your data, analysis, and visualization exploratorium noun [usu. in names ] stack where you a scientific museum or similar center at which visitors have the build an intuitive opportunity of performing prearranged experiments or demonstrations. cognitive model of the information Yes! That means visualized. there’s code and dataThursday, June 23, 2011 8
  • 9. Data Exploration Example • study evolution of technology in patent records – technology is a window on culture – patents are a window on technologyThursday, June 23, 2011 9
  • 10. Patent NetworksThursday, June 23, 2011 10
  • 11. Citation Analysis of PatentsThursday, June 23, 2011 11
  • 12. Time Series Text AnalysisThursday, June 23, 2011 12
  • 13. Some explorations are more open endedThursday, June 23, 2011 13
  • 14. Pointwise Mutual Information (PMI) # patents that contain words x and yThursday, June 23, 2011 14
  • 15. PMI distributions - see clusters - different kinds of clustersThursday, June 23, 2011 15
  • 16. PMI Comparison: Plotting a different way “the” PMI integral halfway rank “optical” - generality of content? “cultivar”Thursday, June 23, 2011 16
  • 17. btw, these are older graphs, now we use ggplot2Thursday, June 23, 2011 17
  • 18. Previous Work in Health Care... 500,000 400,000 Bill   volume Placement  in distribution  of  billed 300,000 Upper  5% 200,000 Bottom  5% 100,000 0 AMB ASC DME ER IPH OPH PRO Adjudication  type .... with @homerstrong at Qmedtrix Systems Inc.Thursday, June 23, 2011 18
  • 19. Previous Work in Health Care... 120,000 Bill  volume 100,000 80,000 60,000 40,000 20,000 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 1.4e+09 1.2e+09 Dollar  density 1.0e+09 8.0e+08 Billed 6.0e+08 First  Audit 4.0e+08 Second  Audit 2.0e+08 0.0e+00 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Amount  ($) ... @hadleywickham is a #ballR http://had.co.nzThursday, June 23, 2011 19
  • 20. Health Care Data & Code Samples... ...Hahaha Just KiddingThursday, June 23, 2011 20
  • 21. But actually: • Qmedtrix R&D team members made source contributions, see: • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort) • Kevin Lynagh https://github.com/lynaghk (Keming Labs)Thursday, June 23, 2011 21
  • 22. Exploratorium #1 Patent Networks citations amongst top 10k most cited patentsThursday, June 23, 2011 22
  • 23. Grab the graph data: ~/exploratorium/patents/toplinks.dot Graphviz Art is Pretty!Thursday, June 23, 2011 23
  • 24. GraphViz Can Graph really big graphs... but they get hard to use -> <- Psychedelic PatentsThursday, June 23, 2011 24
  • 25. Graphviz - Play with Graphs (http://www.graphviz.org) • sudo port install graphviz or sudo apt-get install graphviz • graphing commands: dot,neato,twopi,circo,fdp • dot -Tpdf -o file.dot • More options here: • http://www.graphviz.org/content/command-line-invocation • Fun options are in the .dot file: • http://www.graphviz.org/content/dot-languageThursday, June 23, 2011 25
  • 26. Styling dots • node [shape=point, width="0.15",color="#0000001c"]; • edge [arrowsize="0.50", color="#0000001c"]; • There are tons, read the docs and have fun • You can also try more complex things • Like constraints, time for example • Sometimes too many constraints makes GraphViz unhappy...Thursday, June 23, 2011 26
  • 27. Thursday, June 23, 2011 27
  • 28. UbiGraph • We loved UbiGraph, but don’t know an OSS alternative • Renders many nodes in 3D in realtime FD-layout (50k+). • 16gb of ram Mac Pro • Shout out to Apple: thank you for supporting our research! • It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it! • Alternatives?Thursday, June 23, 2011 28
  • 29. Exploratorium #2 • Making graphs of language using python, redis, R and a bunch of awesome libraries • Thanks • @hadleywickham • @homerstrong • @antirez • Bryan Lewis (http://illposed.net/)Thursday, June 23, 2011 29
  • 30. ...how? Mine — Munge — VisualizeThursday, June 23, 2011 30
  • 31. ...how? github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR packageThursday, June 23, 2011 31
  • 32. Best show on TVThursday, June 23, 2011 32
  • 33. Best show on TVThursday, June 23, 2011 32
  • 34. Best show on TVThursday, June 23, 2011 32
  • 35. Best show on TVThursday, June 23, 2011 32
  • 36. Best show on TVThursday, June 23, 2011 33
  • 37. Mine the data • gutenberg.org • google.com/ngrams • APIs — Twitter, etc. • http://code.google.com/apis/socialgraph/ • ScrapeThursday, June 23, 2011 34
  • 38. Store the dataThursday, June 23, 2011 35
  • 39. Store the data Postgres is not too shabbyThursday, June 23, 2011 35
  • 40. Store the data SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10Thursday, June 23, 2011 36
  • 41. Store the data SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part(year, `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite`Thursday, June 23, 2011 37
  • 42. Store the data SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > 1990-01-01 AND grantdate < 2000-01-01)) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50;Thursday, June 23, 2011 38
  • 43. Store the dataThursday, June 23, 2011 39
  • 44. Store the data NoSQL is a good fit for web dataThursday, June 23, 2011 40
  • 45. Reshape the dataThursday, June 23, 2011 41
  • 46. Reshape the data citer citee a b c b b dThursday, June 23, 2011 41
  • 47. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] }Thursday, June 23, 2011 41
  • 48. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }Thursday, June 23, 2011 41
  • 49. Redis In-Memory Data Structure ServerThursday, June 23, 2011 42
  • 50. RedisThursday, June 23, 2011 43
  • 51. Redis • HSET key name value • SADD key value • ZUNIONSTORE • HSETNX • BRPOPLPUSH •…Thursday, June 23, 2011 44
  • 52. RedisThursday, June 23, 2011 45
  • 53. Redis Global variable for all your programsThursday, June 23, 2011 45
  • 54. Redis Global variable for all your programs Memcached with structureThursday, June 23, 2011 45
  • 55. Redis Global variable for all your programs Memcached with structure Really fastThursday, June 23, 2011 45
  • 56. Redis Global variable for all your programs Memcached with structure Really really fastThursday, June 23, 2011 46
  • 57. Redis Global variable for all your programs Memcached with structure Really, really, astonishingly fastThursday, June 23, 2011 47
  • 58. Redis Global variable for all your programs Memcached with structure No, faster than thatThursday, June 23, 2011 48
  • 59. RedditThursday, June 23, 2011 49
  • 60. RedditThursday, June 23, 2011 49
  • 61. RedditThursday, June 23, 2011 50
  • 62. Reddit • Count words by hourThursday, June 23, 2011 50
  • 63. Reddit • Count words by hour • Comment networkThursday, June 23, 2011 50
  • 64. Reddit • Count words by hour • Comment network • User networkThursday, June 23, 2011 50
  • 65. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 • Comment network • User networkThursday, June 23, 2011 50
  • 66. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network • User networkThursday, June 23, 2011 50
  • 67. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments • User networkThursday, June 23, 2011 50
  • 68. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User networkThursday, June 23, 2011 50
  • 69. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:usersThursday, June 23, 2011 50
  • 70. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id”Thursday, June 23, 2011 50
  • 71. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threadsThursday, June 23, 2011 50
  • 72. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads thread_idThursday, June 23, 2011 50
  • 73. Reddit github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR packageThursday, June 23, 2011 51
  • 74. Reddit (demo)Thursday, June 23, 2011 52
  • 75. Reddit Go forth and graph! #exploratorium #osb11Thursday, June 23, 2011 53
  • 76. Reddit Go forth and graph! #exploratorium #osb11 We will hire you.Thursday, June 23, 2011 53
  • 77. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. For reals.Thursday, June 23, 2011 53
  • 78. You Are Now Leaving the Big Data Exploratorium Please ensure you have your valuables. Noah Pepper @noahmp Devin Chalmers @qwzybug #exploratorium #osb11Thursday, June 23, 2011 54

×