The Big Data Exploratorium

The Big Data
Exploratorium
A guided tour of open source
data analysis tools

Noah Pepper (@noahmp)
Devin Chalmers (@qwzybug)

#exploratorium @osb11

Thursday, June 23, 2011 1

Hi,

• We’re here because...

• We are...

• Data Exploration Is...

• Example 1: Patents

• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)

• Example 2: Health Care

• (Pepper et al. Visweek 2010)


Hi,

• Exploratorium #1

• Patent citation networks

• Graphviz

• NetworkX

• Exploratorium #2

• Reddit comment word usages


Hi,

• Get the code & data samples:

• git clone git@github.com:peppern/exploratorium.git


We’re here because...

• There is a really amazing OSS community in the data space.

• This is fantastic news for academics, hobbyists, and professionals alike.

• We want to show what you can do with open source tools, show you the ones
we like.

• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.

• Data exploration is fun...


We are...

Noah Pepper - @noahmp
Devin Chalmers - @qwzybug
• Academic Data Junkies • We’re Sorta Lucky

Our academic
home. Research
focuses on on
exploring the nature Our startup
of evolutionary where we build data
activity through data exploration
mining platforms

We Build Data Exploration Tools!

map.clearhealthcosts.com


What is data exploration and what is an exploratorium

• Narrow Deﬁnition • Why do I say
visualization
instead of the more
• Data exploration is
general
having an iterative
‘representation’?
relationship with
your data, analysis,
and visualization exploratorium
noun [usu. in names ]
stack where you a scientiﬁc museum or similar center at which visitors have the
build an intuitive opportunity of performing prearranged experiments or
demonstrations.
cognitive model of
the information
Yes! That means
visualized. there’s code
and data


Data Exploration Example

• study evolution of technology in patent records
– technology is a window on culture
– patents are a window on technology


Patent Networks


Citation Analysis of Patents


Time Series Text Analysis


Some explorations are more open ended


Pointwise Mutual Information (PMI)

# patents that contain words x and y


PMI distributions

- see clusters
- different kinds
of clusters


PMI Comparison: Plotting a different way

“the”

PMI integral
halfway rank

“optical” - generality
of content?

“cultivar”


btw, these are older graphs, now we use ggplot2


Previous Work in Health Care...

500,000

400,000
Bill volume

Placement in
distribution of billed
300,000
Upper 5%

200,000

Bottom 5%
100,000

0

AMB ASC DME ER IPH OPH PRO

Adjudication type

.... with @homerstrong
at Qmedtrix Systems Inc.

Previous Work in Health Care...
120,000
Bill volume

100,000

80,000

60,000

40,000

20,000

0
10 1
10 2
10 3
10 4
10 5 10 6
10 7

1.4e+09
1.2e+09
Dollar density

1.0e+09
8.0e+08
Billed
6.0e+08 First Audit
4.0e+08 Second Audit

2.0e+08
0.0e+00
10 1
10 2
10 3
10 4
10 5 10 6
10 7

Amount ($)

... @hadleywickham is a #ballR
http://had.co.nz

Health Care Data & Code Samples...

...Hahaha Just Kidding


But actually:

• Qmedtrix R&D team members made source contributions, see:

• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)

• Kevin Lynagh https://github.com/lynaghk (Keming Labs)


Exploratorium #1 Patent Networks

citations
amongst
top 10k
most cited
patents


Grab the graph data:
~/exploratorium/patents/toplinks.dot

Graphviz Art is Pretty!

GraphViz Can
Graph really big
graphs... but they
get hard to use ->

<- Psychedelic
Patents


Graphviz - Play with Graphs
(http://www.graphviz.org)

• sudo port install graphviz or sudo apt-get install graphviz

• graphing commands: dot,neato,twopi,circo,fdp

• dot -Tpdf -o ﬁle.dot

• More options here:

• http://www.graphviz.org/content/command-line-invocation

• Fun options are in the .dot ﬁle:

• http://www.graphviz.org/content/dot-language


Styling dots

• node [shape=point, width="0.15",color="#0000001c"];

• edge [arrowsize="0.50", color="#0000001c"];

• There are tons, read the docs and have fun

• You can also try more complex things

• Like constraints, time for example

• Sometimes too many constraints makes GraphViz unhappy...


UbiGraph

• We loved UbiGraph, but don’t know an OSS alternative

• Renders many nodes in 3D in realtime FD-layout (50k+).

• 16gb of ram Mac Pro

• Shout out to Apple: thank you for supporting our research!

• It’s ‘free’ but development has stalled and since it’s closed source we can’t
build on it!

• Alternatives?


Exploratorium #2

• Making graphs of language using python, redis, R and a bunch of awesome
libraries

• Thanks

• @hadleywickham

• @homerstrong

• @antirez

• Bryan Lewis (http://illposed.net/)


...how?
Mine — Munge — Visualize


...how?
github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.org
github.com/qwzybug/rredis
redis TTR package


Best show on TV


Mine the data

• gutenberg.org

• google.com/ngrams

• APIs — Twitter, etc.

• http://code.google.com/apis/socialgraph/

• Scrape


Store the data


Store the data

Postgres is not too shabby


Store the data

SELECT cite AS patent_num, count FROM (SELECT cite,
count(*) AS count FROM citations GROUP BY cite) AS t1
ORDER BY t1.count DESC LIMIT 10


Store the data

SELECT `cite`, count(*), `year` FROM `citations`
INNER JOIN (SELECT date_part('year', `grantdate`) AS
`year`, `patent_num` AS `patent_num` FROM `patents`)
AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
GROUP BY `year`, `cite`


Store the data

SELECT term, count FROM (SELECT term, count(*) FROM
(SELECT patent_num, term FROM tfidfs WHERE (tfidf >
0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
patent_num FROM patent_lengths WHERE (wordcount >
10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
(grantdate > '1990-01-01' AND grantdate <
'2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
USING ("patent_num") GROUP BY "term") AS "t3" ORDER
BY count DESC LIMIT 50;


Store the data


Store the data

NoSQL is a good ﬁt for web data


Reshape the data


Reshape the data

citer citee
a b
c b
b d


Reshape the data

citer citee
a b
c b
b d

{ a : [b], c : [b], b: [d] }


Reshape the data

citer citee
a b
c b
b d

{ a : [b], c : [b], b: [d] } { b : [a, c], d : [b] }


Redis

In-Memory Data Structure Server


Redis


Redis

• HSET key name value

• SADD key value

• ZUNIONSTORE

• HSETNX

• BRPOPLPUSH

•…


Redis


Redis

Global variable for all your programs


Redis


Memcached with structure


Redis



Really fast


Redis



Really really fast


Redis



Really, really, astonishingly fast


Redis



No, faster than that


Reddit

• Count words by hour


Reddit


• Comment network


Reddit


• Comment network

• User network


Reddit

• Count words by hour ZSET subreddit:2011-06-21:12

• Comment network

• User network


Reddit

word [count]
• Comment network

• User network


Reddit

word [count]
• Comment network SET thread_id:comments

• User network


Reddit

word [count]
“parent_id:child_id”

• User network


Reddit

word [count]

• User network SET thread_id:users


Reddit

word [count]



Reddit

word [count]

SET subreddit:threads


Reddit

word [count]

SET subreddit:threads
thread_id


Reddit

github.com/peppern/exploratorium

[ brew | apt-get | port ] install redis

www.r-project.org
github.com/qwzybug/rredis
redis TTR package


Reddit

(demo)


Reddit

Go forth and graph!

#exploratorium #osb11


Reddit

Go forth and graph!


We will hire you.


Reddit

Go forth and graph!


We will hire you.

For reals.


You Are Now Leaving
the Big Data
Exploratorium
Please ensure you have your
valuables.

Noah Pepper @noahmp
Devin Chalmers @qwzybug



The Big Data Exploratorium

Recommended

Recommended

More Related Content

Similar to The Big Data Exploratorium

Similar to The Big Data Exploratorium (20)

The Big Data Exploratorium