SlideShare a Scribd company logo
1 of 78
Download to read offline
The Big Data
       Exploratorium
       A guided tour of open source
       data analysis tools

       Noah Pepper (@noahmp)
       Devin Chalmers (@qwzybug)

       #exploratorium @osb11




Thursday, June 23, 2011               1
Hi,

       • We’re here because...


       • We are...


       • Data Exploration Is...


             • Example 1: Patents


                   • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)


             • Example 2: Health Care


                   • (Pepper et al. Visweek 2010)


Thursday, June 23, 2011                                                                2
Hi,

                   • Exploratorium #1


                          • Patent citation networks


                             • Graphviz


                             • NetworkX


                   • Exploratorium #2


                          • Reddit comment word usages




Thursday, June 23, 2011                                  3
Hi,




     • Get the code & data samples:


     • git clone git@github.com:peppern/exploratorium.git




Thursday, June 23, 2011                                     4
We’re here because...

       • There is a really amazing OSS community in the data space.


       • This is fantastic news for academics, hobbyists, and professionals alike.


       • We want to show what you can do with open source tools, show you the ones
         we like.


       • We’d love to hear about what YOUR favorites are, #exploratorium to tell us.


       • Data exploration is fun...




Thursday, June 23, 2011                                                                5
We are...

                            Noah Pepper - @noahmp
                           Devin Chalmers - @qwzybug
               • Academic Data Junkies    • We’re Sorta Lucky



                   Our academic
                  home. Research
                   focuses on on
                exploring the nature         Our startup
                   of evolutionary        where we build data
                activity through data         exploration
                        mining                 platforms
Thursday, June 23, 2011                                         6
We Build Data Exploration Tools!

                                     map.clearhealthcosts.com




Thursday, June 23, 2011                                         7
What is data exploration and what is an exploratorium

       • Narrow Definition       • Why do I say
                                  visualization
                                  instead of the more
       • Data exploration is
                                  general
         having an iterative
                                  ‘representation’?
         relationship with
         your data, analysis,
         and visualization         exploratorium
                                   noun [usu. in names ]
         stack where you           a scientific museum or similar center at which visitors have the
         build an intuitive        opportunity of performing prearranged experiments or
                                   demonstrations.
         cognitive model of
         the information
                                                                           Yes! That means
         visualized.                                                         there’s code
                                                                               and data


Thursday, June 23, 2011                                                                              8
Data Exploration Example


             • study evolution of technology in patent records
                   – technology is a window on culture
                   – patents are a window on technology




Thursday, June 23, 2011                                          9
Patent Networks




Thursday, June 23, 2011   10
Citation Analysis of Patents




Thursday, June 23, 2011               11
Time Series Text Analysis




Thursday, June 23, 2011            12
Some explorations are more open ended




Thursday, June 23, 2011                                           13
Pointwise Mutual Information (PMI)




          # patents that contain words x and y




Thursday, June 23, 2011                          14
PMI distributions


         - see clusters
         - different kinds
           of clusters




Thursday, June 23, 2011      15
PMI Comparison: Plotting a different way


                                         “the”

                                                          PMI integral
                                                          halfway rank

                                         “optical”    -    generality
                                                          of content?




                                         “cultivar”


Thursday, June 23, 2011                                                  16
btw, these are older graphs, now we use ggplot2




Thursday, June 23, 2011                                     17
Previous Work in Health Care...


                 500,000


                 400,000
 Bill   volume




                                                                    Placement  in
                                                                    distribution  of  billed
                 300,000
                                                                         Upper  5%


                 200,000

                                                                         Bottom  5%
                 100,000


                      0

                           AMB   ASC   DME   ER   IPH   OPH   PRO

                                 Adjudication  type




                 .... with @homerstrong
                 at Qmedtrix Systems Inc.
Thursday, June 23, 2011                                                                        18
Previous Work in Health Care...
                              120,000
        Bill  volume




                              100,000

                               80,000

                               60,000

                               40,000

                               20,000

                                   0
                                        10   1
                                                 10   2
                                                          10   3
                                                                   10   4
                                                                            10   5   10   6
                                                                                              10   7




                              1.4e+09
                              1.2e+09
            Dollar  density




                              1.0e+09
                              8.0e+08
                                                                                                       Billed
                              6.0e+08                                                                  First  Audit
                              4.0e+08                                                                  Second  Audit

                              2.0e+08
                              0.0e+00
                                        10   1
                                                 10   2
                                                          10   3
                                                                   10   4
                                                                            10   5   10   6
                                                                                              10   7


                                                          Amount  ($)



                                                                                              ... @hadleywickham is a #ballR
                                                                                                            http://had.co.nz
Thursday, June 23, 2011                                                                                                        19
Health Care Data & Code Samples...




                              ...Hahaha Just Kidding

Thursday, June 23, 2011                                20
But actually:

       • Qmedtrix R&D team members made source contributions, see:


             • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)


             • Kevin Lynagh https://github.com/lynaghk (Keming Labs)




Thursday, June 23, 2011                                                            21
Exploratorium #1 Patent Networks




   citations
  amongst
   top 10k
  most cited
   patents




Thursday, June 23, 2011                                      22
Grab the graph data:
                          ~/exploratorium/patents/toplinks.dot




                                               Graphviz Art is Pretty!
Thursday, June 23, 2011                                                  23
GraphViz Can
           Graph really big
          graphs... but they
          get hard to use ->




                               <- Psychedelic
                                  Patents


Thursday, June 23, 2011                         24
Graphviz - Play with Graphs
       (http://www.graphviz.org)

       • sudo port install graphviz or sudo apt-get install graphviz


       • graphing commands: dot,neato,twopi,circo,fdp


       • dot -Tpdf -o file.dot


       • More options here:


             • http://www.graphviz.org/content/command-line-invocation


       • Fun options are in the .dot file:


             • http://www.graphviz.org/content/dot-language


Thursday, June 23, 2011                                                  25
Styling dots

       • 	 node [shape=point, width="0.15",color="#0000001c"];


       • 	 edge [arrowsize="0.50", color="#0000001c"];


       • There are tons, read the docs and have fun


       • You can also try more complex things


             • Like constraints, time for example


             • Sometimes too many constraints makes GraphViz unhappy...




Thursday, June 23, 2011                                                   26
Thursday, June 23, 2011   27
UbiGraph

       • We loved UbiGraph, but don’t know an OSS alternative


       • Renders many nodes in 3D in realtime FD-layout (50k+).


             • 16gb of ram Mac Pro


                          • Shout out to Apple: thank you for supporting our research!


       • It’s ‘free’ but development has stalled and since it’s closed source we can’t
         build on it!


       • Alternatives?




Thursday, June 23, 2011                                                                  28
Exploratorium #2

       • Making graphs of language using python, redis, R and a bunch of awesome
         libraries


       • Thanks


             • @hadleywickham


             • @homerstrong


             • @antirez


             • Bryan Lewis (http://illposed.net/)




Thursday, June 23, 2011                                                            29
...how?
       Mine — Munge — Visualize




Thursday, June 23, 2011           30
...how?
       github.com/peppern/exploratorium

       [ brew | apt-get | port ] install redis

       www.r-project.org
       github.com/qwzybug/rredis
       redis TTR package




Thursday, June 23, 2011                          31
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   32
Best show on TV




Thursday, June 23, 2011   33
Mine the data

       • gutenberg.org


       • google.com/ngrams


       • APIs — Twitter, etc.


       • http://code.google.com/apis/socialgraph/


       • Scrape




Thursday, June 23, 2011                             34
Store the data




Thursday, June 23, 2011   35
Store the data




                          Postgres is not too shabby




Thursday, June 23, 2011                                35
Store the data




            SELECT cite AS patent_num, count FROM (SELECT cite,
            count(*) AS count FROM citations GROUP BY cite) AS t1
            ORDER BY t1.count DESC LIMIT 10




Thursday, June 23, 2011                                             36
Store the data




            SELECT `cite`, count(*), `year` FROM `citations`
            INNER JOIN (SELECT date_part('year', `grantdate`) AS
            `year`, `patent_num` AS `patent_num` FROM `patents`)
            AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
            GROUP BY `year`, `cite`




Thursday, June 23, 2011                                            37
Store the data



            SELECT term, count FROM (SELECT term, count(*) FROM
            (SELECT patent_num, term FROM tfidfs WHERE (tfidf >
            0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
            patent_num FROM patent_lengths WHERE (wordcount >
            10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
            (grantdate > '1990-01-01' AND grantdate <
            '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
            USING ("patent_num") GROUP BY "term") AS "t3" ORDER
            BY count DESC LIMIT 50;




Thursday, June 23, 2011                                            38
Store the data




Thursday, June 23, 2011   39
Store the data




                          NoSQL is a good fit for web data




Thursday, June 23, 2011                                     40
Reshape the data




Thursday, June 23, 2011   41
Reshape the data



                          citer   citee
                           a       b
                           c       b
                           b       d




Thursday, June 23, 2011                   41
Reshape the data



                             citer   citee
                              a       b
                              c       b
                              b       d




      { a : [b], c : [b], b: [d] }

Thursday, June 23, 2011                      41
Reshape the data



                             citer   citee
                              a       b
                              c       b
                              b       d




      { a : [b], c : [b], b: [d] }        { b : [a, c], d : [b] }

Thursday, June 23, 2011                                             41
Redis




                          In-Memory Data Structure Server




Thursday, June 23, 2011                                     42
Redis




Thursday, June 23, 2011   43
Redis

       • HSET key name value


       • SADD key value


       • ZUNIONSTORE


       • HSETNX


       • BRPOPLPUSH


       •…




Thursday, June 23, 2011        44
Redis




Thursday, June 23, 2011   45
Redis




                          Global variable for all your programs




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure

                                       Really fast




Thursday, June 23, 2011                                           45
Redis




                          Global variable for all your programs

                              Memcached with structure

                                    Really really fast




Thursday, June 23, 2011                                           46
Redis




                          Global variable for all your programs

                              Memcached with structure

                            Really, really, astonishingly fast




Thursday, June 23, 2011                                           47
Redis




                          Global variable for all your programs

                              Memcached with structure

                                  No, faster than that




Thursday, June 23, 2011                                           48
Reddit




Thursday, June 23, 2011   49
Reddit




Thursday, June 23, 2011   49
Reddit




Thursday, June 23, 2011   50
Reddit

       • Count words by hour




Thursday, June 23, 2011        50
Reddit

       • Count words by hour


       • Comment network




Thursday, June 23, 2011        50
Reddit

       • Count words by hour


       • Comment network


       • User network




Thursday, June 23, 2011        50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12


       • Comment network


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments


       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”
                               SET subreddit:threads




Thursday, June 23, 2011                                       50
Reddit

       • Count words by hour   ZSET subreddit:2011-06-21:12
                                     word [count]
       • Comment network       SET thread_id:comments
                                     “parent_id:child_id”

       • User network          SET thread_id:users
                                     “parent_id:child_id”
                               SET subreddit:threads
                                     thread_id




Thursday, June 23, 2011                                       50
Reddit


       github.com/peppern/exploratorium

       [ brew | apt-get | port ] install redis

       www.r-project.org
       github.com/qwzybug/rredis
       redis TTR package




Thursday, June 23, 2011                          51
Reddit




                          (demo)




Thursday, June 23, 2011            52
Reddit



                           Go forth and graph!

                          #exploratorium #osb11




Thursday, June 23, 2011                           53
Reddit



                           Go forth and graph!

                          #exploratorium #osb11

                             We will hire you.




Thursday, June 23, 2011                           53
Reddit



                           Go forth and graph!

                          #exploratorium #osb11

                             We will hire you.

                                For reals.


Thursday, June 23, 2011                           53
You Are Now Leaving
       the Big Data
       Exploratorium
       Please ensure you have your
       valuables.

       Noah Pepper @noahmp
       Devin Chalmers @qwzybug

       #exploratorium #osb11




Thursday, June 23, 2011              54

More Related Content

Similar to The Big Data Exploratorium

2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldiBruce Heterick
 
2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldiBruce Heterick
 
New e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionNew e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionDavid De Roure
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformaticsJan Aerts
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...Paolo Missier
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Paolo Missier
 
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossref
 
Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital ScienceKaitlin Thaney
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomicsYannick Wurm
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignAlice Sheppard
 
International scholarly infrastructures
International scholarly infrastructuresInternational scholarly infrastructures
International scholarly infrastructuresJisc
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search petermurrayrust
 
How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)Aaron Sloman
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guytannepartos
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014Right to Research
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research DataRoss Mounce
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...Francisco Couto
 

Similar to The Big Data Exploratorium (20)

2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldi
 
2011.11.03.charleston.ldi
2011.11.03.charleston.ldi2011.11.03.charleston.ldi
2011.11.03.charleston.ldi
 
New e-Science Edinburgh Late Edition
New e-Science Edinburgh Late EditionNew e-Science Edinburgh Late Edition
New e-Science Edinburgh Late Edition
 
Humanizing bioinformatics
Humanizing bioinformaticsHumanizing bioinformatics
Humanizing bioinformatics
 
The W3C PROV standard: data model for the provenance of information, and enab...
The W3C PROV standard:data model for the provenance of information, and enab...The W3C PROV standard:data model for the provenance of information, and enab...
The W3C PROV standard: data model for the provenance of information, and enab...
 
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009Invited talk at the GeoClouds Workshop, Indianapolis, 2009
Invited talk at the GeoClouds Workshop, Indianapolis, 2009
 
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRefCrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
CrossMark and Other Interesting Developments, Aries EMUG Meeting at CrossRef
 
Policy Lunchbox - Digital Science
Policy Lunchbox - Digital SciencePolicy Lunchbox - Digital Science
Policy Lunchbox - Digital Science
 
Recommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenuRecommandation sociale : filtrage collaboratif et par le contenu
Recommandation sociale : filtrage collaboratif et par le contenu
 
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics2018 09-03-ses open-fair_practices_in_evolutionary_genomics
2018 09-03-ses open-fair_practices_in_evolutionary_genomics
 
CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...CODATA International Training Workshop in Big Data for Science for Researcher...
CODATA International Training Workshop in Big Data for Science for Researcher...
 
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data CurationImproving Support for Researchers: How Data Reuse Can Inform Data Curation
Improving Support for Researchers: How Data Reuse Can Inform Data Curation
 
Michael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project DesignMichael Pocock: Citizen Science Project Design
Michael Pocock: Citizen Science Project Design
 
International scholarly infrastructures
International scholarly infrastructuresInternational scholarly infrastructures
International scholarly infrastructures
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)How to Build a Research Roadmap (avoiding tempting dead-ends)
How to Build a Research Roadmap (avoiding tempting dead-ends)
 
Sensemaker for Partos Plaza - Irene Guyt
Sensemaker for Partos Plaza  - Irene GuytSensemaker for Partos Plaza  - Irene Guyt
Sensemaker for Partos Plaza - Irene Guyt
 
The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014The State of Open Research Data - OpenCon 2014
The State of Open Research Data - OpenCon 2014
 
The State of Open Research Data
The State of Open Research DataThe State of Open Research Data
The State of Open Research Data
 
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
KnowledgeCoin : recognizing and rewarding metadata integration and sharing ...
 

The Big Data Exploratorium

  • 1. The Big Data Exploratorium A guided tour of open source data analysis tools Noah Pepper (@noahmp) Devin Chalmers (@qwzybug) #exploratorium @osb11 Thursday, June 23, 2011 1
  • 2. Hi, • We’re here because... • We are... • Data Exploration Is... • Example 1: Patents • (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008) • Example 2: Health Care • (Pepper et al. Visweek 2010) Thursday, June 23, 2011 2
  • 3. Hi, • Exploratorium #1 • Patent citation networks • Graphviz • NetworkX • Exploratorium #2 • Reddit comment word usages Thursday, June 23, 2011 3
  • 4. Hi, • Get the code & data samples: • git clone git@github.com:peppern/exploratorium.git Thursday, June 23, 2011 4
  • 5. We’re here because... • There is a really amazing OSS community in the data space. • This is fantastic news for academics, hobbyists, and professionals alike. • We want to show what you can do with open source tools, show you the ones we like. • We’d love to hear about what YOUR favorites are, #exploratorium to tell us. • Data exploration is fun... Thursday, June 23, 2011 5
  • 6. We are... Noah Pepper - @noahmp Devin Chalmers - @qwzybug • Academic Data Junkies • We’re Sorta Lucky Our academic home. Research focuses on on exploring the nature Our startup of evolutionary where we build data activity through data exploration mining platforms Thursday, June 23, 2011 6
  • 7. We Build Data Exploration Tools! map.clearhealthcosts.com Thursday, June 23, 2011 7
  • 8. What is data exploration and what is an exploratorium • Narrow Definition • Why do I say visualization instead of the more • Data exploration is general having an iterative ‘representation’? relationship with your data, analysis, and visualization exploratorium noun [usu. in names ] stack where you a scientific museum or similar center at which visitors have the build an intuitive opportunity of performing prearranged experiments or demonstrations. cognitive model of the information Yes! That means visualized. there’s code and data Thursday, June 23, 2011 8
  • 9. Data Exploration Example • study evolution of technology in patent records – technology is a window on culture – patents are a window on technology Thursday, June 23, 2011 9
  • 11. Citation Analysis of Patents Thursday, June 23, 2011 11
  • 12. Time Series Text Analysis Thursday, June 23, 2011 12
  • 13. Some explorations are more open ended Thursday, June 23, 2011 13
  • 14. Pointwise Mutual Information (PMI) # patents that contain words x and y Thursday, June 23, 2011 14
  • 15. PMI distributions - see clusters - different kinds of clusters Thursday, June 23, 2011 15
  • 16. PMI Comparison: Plotting a different way “the” PMI integral halfway rank “optical” - generality of content? “cultivar” Thursday, June 23, 2011 16
  • 17. btw, these are older graphs, now we use ggplot2 Thursday, June 23, 2011 17
  • 18. Previous Work in Health Care... 500,000 400,000 Bill   volume Placement  in distribution  of  billed 300,000 Upper  5% 200,000 Bottom  5% 100,000 0 AMB ASC DME ER IPH OPH PRO Adjudication  type .... with @homerstrong at Qmedtrix Systems Inc. Thursday, June 23, 2011 18
  • 19. Previous Work in Health Care... 120,000 Bill  volume 100,000 80,000 60,000 40,000 20,000 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 1.4e+09 1.2e+09 Dollar  density 1.0e+09 8.0e+08 Billed 6.0e+08 First  Audit 4.0e+08 Second  Audit 2.0e+08 0.0e+00 10 1 10 2 10 3 10 4 10 5 10 6 10 7 Amount  ($) ... @hadleywickham is a #ballR http://had.co.nz Thursday, June 23, 2011 19
  • 20. Health Care Data & Code Samples... ...Hahaha Just Kidding Thursday, June 23, 2011 20
  • 21. But actually: • Qmedtrix R&D team members made source contributions, see: • Homer Strong https://github.com/strongh @homerstrong (Lucky Sort) • Kevin Lynagh https://github.com/lynaghk (Keming Labs) Thursday, June 23, 2011 21
  • 22. Exploratorium #1 Patent Networks citations amongst top 10k most cited patents Thursday, June 23, 2011 22
  • 23. Grab the graph data: ~/exploratorium/patents/toplinks.dot Graphviz Art is Pretty! Thursday, June 23, 2011 23
  • 24. GraphViz Can Graph really big graphs... but they get hard to use -> <- Psychedelic Patents Thursday, June 23, 2011 24
  • 25. Graphviz - Play with Graphs (http://www.graphviz.org) • sudo port install graphviz or sudo apt-get install graphviz • graphing commands: dot,neato,twopi,circo,fdp • dot -Tpdf -o file.dot • More options here: • http://www.graphviz.org/content/command-line-invocation • Fun options are in the .dot file: • http://www.graphviz.org/content/dot-language Thursday, June 23, 2011 25
  • 26. Styling dots • node [shape=point, width="0.15",color="#0000001c"]; • edge [arrowsize="0.50", color="#0000001c"]; • There are tons, read the docs and have fun • You can also try more complex things • Like constraints, time for example • Sometimes too many constraints makes GraphViz unhappy... Thursday, June 23, 2011 26
  • 28. UbiGraph • We loved UbiGraph, but don’t know an OSS alternative • Renders many nodes in 3D in realtime FD-layout (50k+). • 16gb of ram Mac Pro • Shout out to Apple: thank you for supporting our research! • It’s ‘free’ but development has stalled and since it’s closed source we can’t build on it! • Alternatives? Thursday, June 23, 2011 28
  • 29. Exploratorium #2 • Making graphs of language using python, redis, R and a bunch of awesome libraries • Thanks • @hadleywickham • @homerstrong • @antirez • Bryan Lewis (http://illposed.net/) Thursday, June 23, 2011 29
  • 30. ...how? Mine — Munge — Visualize Thursday, June 23, 2011 30
  • 31. ...how? github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR package Thursday, June 23, 2011 31
  • 32. Best show on TV Thursday, June 23, 2011 32
  • 33. Best show on TV Thursday, June 23, 2011 32
  • 34. Best show on TV Thursday, June 23, 2011 32
  • 35. Best show on TV Thursday, June 23, 2011 32
  • 36. Best show on TV Thursday, June 23, 2011 33
  • 37. Mine the data • gutenberg.org • google.com/ngrams • APIs — Twitter, etc. • http://code.google.com/apis/socialgraph/ • Scrape Thursday, June 23, 2011 34
  • 38. Store the data Thursday, June 23, 2011 35
  • 39. Store the data Postgres is not too shabby Thursday, June 23, 2011 35
  • 40. Store the data SELECT cite AS patent_num, count FROM (SELECT cite, count(*) AS count FROM citations GROUP BY cite) AS t1 ORDER BY t1.count DESC LIMIT 10 Thursday, June 23, 2011 36
  • 41. Store the data SELECT `cite`, count(*), `year` FROM `citations` INNER JOIN (SELECT date_part('year', `grantdate`) AS `year`, `patent_num` AS `patent_num` FROM `patents`) AS `t1` USING (`patent_num`) WHERE (cite IN (12345)) GROUP BY `year`, `cite` Thursday, June 23, 2011 37
  • 42. Store the data SELECT term, count FROM (SELECT term, count(*) FROM (SELECT patent_num, term FROM tfidfs WHERE (tfidf > 0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT patent_num FROM patent_lengths WHERE (wordcount > 10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE (grantdate > '1990-01-01' AND grantdate < '2000-01-01')) AS "t2" USING ("patent_num")) AS "t2" USING ("patent_num") GROUP BY "term") AS "t3" ORDER BY count DESC LIMIT 50; Thursday, June 23, 2011 38
  • 43. Store the data Thursday, June 23, 2011 39
  • 44. Store the data NoSQL is a good fit for web data Thursday, June 23, 2011 40
  • 45. Reshape the data Thursday, June 23, 2011 41
  • 46. Reshape the data citer citee a b c b b d Thursday, June 23, 2011 41
  • 47. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } Thursday, June 23, 2011 41
  • 48. Reshape the data citer citee a b c b b d { a : [b], c : [b], b: [d] } { b : [a, c], d : [b] } Thursday, June 23, 2011 41
  • 49. Redis In-Memory Data Structure Server Thursday, June 23, 2011 42
  • 51. Redis • HSET key name value • SADD key value • ZUNIONSTORE • HSETNX • BRPOPLPUSH •… Thursday, June 23, 2011 44
  • 53. Redis Global variable for all your programs Thursday, June 23, 2011 45
  • 54. Redis Global variable for all your programs Memcached with structure Thursday, June 23, 2011 45
  • 55. Redis Global variable for all your programs Memcached with structure Really fast Thursday, June 23, 2011 45
  • 56. Redis Global variable for all your programs Memcached with structure Really really fast Thursday, June 23, 2011 46
  • 57. Redis Global variable for all your programs Memcached with structure Really, really, astonishingly fast Thursday, June 23, 2011 47
  • 58. Redis Global variable for all your programs Memcached with structure No, faster than that Thursday, June 23, 2011 48
  • 62. Reddit • Count words by hour Thursday, June 23, 2011 50
  • 63. Reddit • Count words by hour • Comment network Thursday, June 23, 2011 50
  • 64. Reddit • Count words by hour • Comment network • User network Thursday, June 23, 2011 50
  • 65. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 • Comment network • User network Thursday, June 23, 2011 50
  • 66. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network • User network Thursday, June 23, 2011 50
  • 67. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments • User network Thursday, June 23, 2011 50
  • 68. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network Thursday, June 23, 2011 50
  • 69. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users Thursday, June 23, 2011 50
  • 70. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” Thursday, June 23, 2011 50
  • 71. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads Thursday, June 23, 2011 50
  • 72. Reddit • Count words by hour ZSET subreddit:2011-06-21:12 word [count] • Comment network SET thread_id:comments “parent_id:child_id” • User network SET thread_id:users “parent_id:child_id” SET subreddit:threads thread_id Thursday, June 23, 2011 50
  • 73. Reddit github.com/peppern/exploratorium [ brew | apt-get | port ] install redis www.r-project.org github.com/qwzybug/rredis redis TTR package Thursday, June 23, 2011 51
  • 74. Reddit (demo) Thursday, June 23, 2011 52
  • 75. Reddit Go forth and graph! #exploratorium #osb11 Thursday, June 23, 2011 53
  • 76. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. Thursday, June 23, 2011 53
  • 77. Reddit Go forth and graph! #exploratorium #osb11 We will hire you. For reals. Thursday, June 23, 2011 53
  • 78. You Are Now Leaving the Big Data Exploratorium Please ensure you have your valuables. Noah Pepper @noahmp Devin Chalmers @qwzybug #exploratorium #osb11 Thursday, June 23, 2011 54