This document provides an overview of an "Exploratorium", which is described as a guided tour of open source data analysis tools. It discusses exploring patent and health care data using tools like Graphviz and NetworkX to analyze networks, and Redis to store and reshape data. Examples are given of analyzing Reddit comment and user networks by counting words and mapping relationships between comments and users. The document encourages sharing favorite open source tools using #exploratorium.
The reality for companies that are trying to figure out their blogging or content strategy is that there's a lot of content to write beyond just the "buy now" page.
My books- Hacking Digital Learning Strategies http://hackingdls.com & Learning to Go https://gum.co/learn2go
Resources at http://shellyterrell.com/classmanagement
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
The reality for companies that are trying to figure out their blogging or content strategy is that there's a lot of content to write beyond just the "buy now" page.
My books- Hacking Digital Learning Strategies http://hackingdls.com & Learning to Go https://gum.co/learn2go
Resources at http://shellyterrell.com/classmanagement
How to Build a Dynamic Social Media PlanPost Planner
Stop guessing and wasting your time on networks and strategies that don’t work!
Join Rebekah Radice and Katie Lance to learn how to optimize your social networks, the best kept secrets for hot content, top time management tools, and much more!
Watch the replay here: bit.ly/socialmedia-plan
http://inarocket.com
Learn BEM fundamentals as fast as possible. What is BEM (Block, element, modifier), BEM syntax, how it works with a real example, etc.
Introduction aux systèmes de recommandation : filtrage collaboratif, filtrage par le contenu, recommandation de livres et de lectures.
Présentation dans le cadre des journées ARS2017, Université de la Manouba (Tunis)
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
Presentation at NeDICC Meeting on 16 July 2014. Feedback from CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 5-20 June 2014
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
Presented at Strategic Conversations at Harvard Library, 9 June 2016
Details are here: http://library.harvard.edu/hlsc
In this talk, Ixchel Faniel from OCLC discussed data reuse practices within academic communities as a means to inform data curation. Knowledge of data reuse and curation processes can shape the activities and services of researchers, librarians, and other information professionals in order to enhance data reuse and accelerate research discoveries.
Ixchel M. Faniel is a Research Scientist at OCLC Research.
A presentation by Rachel Bruce, director open science and research lifecycle, Jisc and Matthew Spitzer, community manager, Centre for Open Science (COS).
The ContentMine system (Open Source) can search EuropePMC and download hundreds of articles in seconds. These can be indexed by AMI dictionaries allowing a rapid evaluations and refinement of the search
How to Build a Research Roadmap (avoiding tempting dead-ends)Aaron Sloman
What's a Research Roadmap For?
Why do we need one?
How can we avoid the usual trap of making bold promises to do X, Y and Z,
then hope that our previous promises will not be remembered the next time we apply for funds to do X, Y and Z?
How can we produce a sensible, well informed roadmap?
Originally presented at the euCognition Research Roadmap discussion in Munich on 12 Jan 2007
This suggests a way to avoid tempting dead ends (repeating old promises that proved unrealistic) by examining many long term goals, including describing existing human and animal competences not yet achieved by robots, then working backwards systematically by investigating requirements for those competences, and requirements for meeting those requirements, etc. Insread of generating a single linear roadmap this should produce a partially ordered network of intermediate targets, leading back, to short term goals that may be achievable starting from where we are.
Such a roadmap will inevitably have mistakes: over-optimistic goals, missing preconditions, unrecognised opportunities. But if the work is done in many teams in a fully open manner with as much collaboration as possible, it should be possible to make faster, deeper, progress than can be achieved by brain-storming discussions of where we can get in a few years.
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...Francisco Couto
Research is increasingly becoming a data-intensive science, however proper data integration and
sharing is more than storing the datasets in a public repository, it requires the data to be
organized, characterized and updated continuously. This article assumes that by rewarding and
recognizing metadata sharing and integration on the semantic web using ontologies, we are
promoting and intensifying the trust and quality in data sharing and integration. So, the proposed
approach aims at measuring the knowledge rating of a dataset according to the specificity and
distinctiveness of its mappings to ontology concepts. The knowledge ratings will then be used as
the basis of a novel reward and recognition mechanism that will rely on a virtual currency, dubbed
KnowledgeCoin (KC). Its implementation could explore some of the solutions provided by current
cryptocurrencies, but KC will not be a cryptocurrency since it will not rely on a cryptographic proof
but on a central authority whose trust depends on the knowledge rating measures proposed by
this article. The idea is that every time a scientific article is published, KCs are distributed
according to the knowledge rating of the datasets supporting the article.
Introduction aux systèmes de recommandation : filtrage collaboratif, filtrage par le contenu, recommandation de livres et de lectures.
Présentation dans le cadre des journées ARS2017, Université de la Manouba (Tunis)
CODATA International Training Workshop in Big Data for Science for Researcher...Johann van Wyk
Presentation at NeDICC Meeting on 16 July 2014. Feedback from CODATA International Training Workshop in Big Data for Science for Researchers from Emerging and Developing Countries, Beijing, China, 5-20 June 2014
Improving Support for Researchers: How Data Reuse Can Inform Data CurationOCLC
Presented at Strategic Conversations at Harvard Library, 9 June 2016
Details are here: http://library.harvard.edu/hlsc
In this talk, Ixchel Faniel from OCLC discussed data reuse practices within academic communities as a means to inform data curation. Knowledge of data reuse and curation processes can shape the activities and services of researchers, librarians, and other information professionals in order to enhance data reuse and accelerate research discoveries.
Ixchel M. Faniel is a Research Scientist at OCLC Research.
A presentation by Rachel Bruce, director open science and research lifecycle, Jisc and Matthew Spitzer, community manager, Centre for Open Science (COS).
The ContentMine system (Open Source) can search EuropePMC and download hundreds of articles in seconds. These can be indexed by AMI dictionaries allowing a rapid evaluations and refinement of the search
How to Build a Research Roadmap (avoiding tempting dead-ends)Aaron Sloman
What's a Research Roadmap For?
Why do we need one?
How can we avoid the usual trap of making bold promises to do X, Y and Z,
then hope that our previous promises will not be remembered the next time we apply for funds to do X, Y and Z?
How can we produce a sensible, well informed roadmap?
Originally presented at the euCognition Research Roadmap discussion in Munich on 12 Jan 2007
This suggests a way to avoid tempting dead ends (repeating old promises that proved unrealistic) by examining many long term goals, including describing existing human and animal competences not yet achieved by robots, then working backwards systematically by investigating requirements for those competences, and requirements for meeting those requirements, etc. Insread of generating a single linear roadmap this should produce a partially ordered network of intermediate targets, leading back, to short term goals that may be achievable starting from where we are.
Such a roadmap will inevitably have mistakes: over-optimistic goals, missing preconditions, unrecognised opportunities. But if the work is done in many teams in a fully open manner with as much collaboration as possible, it should be possible to make faster, deeper, progress than can be achieved by brain-storming discussions of where we can get in a few years.
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...Francisco Couto
Research is increasingly becoming a data-intensive science, however proper data integration and
sharing is more than storing the datasets in a public repository, it requires the data to be
organized, characterized and updated continuously. This article assumes that by rewarding and
recognizing metadata sharing and integration on the semantic web using ontologies, we are
promoting and intensifying the trust and quality in data sharing and integration. So, the proposed
approach aims at measuring the knowledge rating of a dataset according to the specificity and
distinctiveness of its mappings to ontology concepts. The knowledge ratings will then be used as
the basis of a novel reward and recognition mechanism that will rely on a virtual currency, dubbed
KnowledgeCoin (KC). Its implementation could explore some of the solutions provided by current
cryptocurrencies, but KC will not be a cryptocurrency since it will not rely on a cryptographic proof
but on a central authority whose trust depends on the knowledge rating measures proposed by
this article. The idea is that every time a scientific article is published, KCs are distributed
according to the knowledge rating of the datasets supporting the article.
KnowledgeCoin: recognizing and rewarding metadata integration and sharing ...
The Big Data Exploratorium
1. The Big Data
Exploratorium
A guided tour of open source
data analysis tools
Noah Pepper (@noahmp)
Devin Chalmers (@qwzybug)
#exploratorium @osb11
Thursday, June 23, 2011 1
2. Hi,
• We’re here because...
• We are...
• Data Exploration Is...
• Example 1: Patents
• (Chalmers et al. 2010; Buchanan et al. 2010; Pepper et al 2008)
• Example 2: Health Care
• (Pepper et al. Visweek 2010)
Thursday, June 23, 2011 2
4. Hi,
• Get the code & data samples:
• git clone git@github.com:peppern/exploratorium.git
Thursday, June 23, 2011 4
5. We’re here because...
• There is a really amazing OSS community in the data space.
• This is fantastic news for academics, hobbyists, and professionals alike.
• We want to show what you can do with open source tools, show you the ones
we like.
• We’d love to hear about what YOUR favorites are, #exploratorium to tell us.
• Data exploration is fun...
Thursday, June 23, 2011 5
6. We are...
Noah Pepper - @noahmp
Devin Chalmers - @qwzybug
• Academic Data Junkies • We’re Sorta Lucky
Our academic
home. Research
focuses on on
exploring the nature Our startup
of evolutionary where we build data
activity through data exploration
mining platforms
Thursday, June 23, 2011 6
7. We Build Data Exploration Tools!
map.clearhealthcosts.com
Thursday, June 23, 2011 7
8. What is data exploration and what is an exploratorium
• Narrow Definition • Why do I say
visualization
instead of the more
• Data exploration is
general
having an iterative
‘representation’?
relationship with
your data, analysis,
and visualization exploratorium
noun [usu. in names ]
stack where you a scientific museum or similar center at which visitors have the
build an intuitive opportunity of performing prearranged experiments or
demonstrations.
cognitive model of
the information
Yes! That means
visualized. there’s code
and data
Thursday, June 23, 2011 8
9. Data Exploration Example
• study evolution of technology in patent records
– technology is a window on culture
– patents are a window on technology
Thursday, June 23, 2011 9
15. PMI distributions
- see clusters
- different kinds
of clusters
Thursday, June 23, 2011 15
16. PMI Comparison: Plotting a different way
“the”
PMI integral
halfway rank
“optical” - generality
of content?
“cultivar”
Thursday, June 23, 2011 16
17. btw, these are older graphs, now we use ggplot2
Thursday, June 23, 2011 17
18. Previous Work in Health Care...
500,000
400,000
Bill volume
Placement in
distribution of billed
300,000
Upper 5%
200,000
Bottom 5%
100,000
0
AMB ASC DME ER IPH OPH PRO
Adjudication type
.... with @homerstrong
at Qmedtrix Systems Inc.
Thursday, June 23, 2011 18
19. Previous Work in Health Care...
120,000
Bill volume
100,000
80,000
60,000
40,000
20,000
0
10 1
10 2
10 3
10 4
10 5 10 6
10 7
1.4e+09
1.2e+09
Dollar density
1.0e+09
8.0e+08
Billed
6.0e+08 First Audit
4.0e+08 Second Audit
2.0e+08
0.0e+00
10 1
10 2
10 3
10 4
10 5 10 6
10 7
Amount ($)
... @hadleywickham is a #ballR
http://had.co.nz
Thursday, June 23, 2011 19
20. Health Care Data & Code Samples...
...Hahaha Just Kidding
Thursday, June 23, 2011 20
21. But actually:
• Qmedtrix R&D team members made source contributions, see:
• Homer Strong https://github.com/strongh @homerstrong (Lucky Sort)
• Kevin Lynagh https://github.com/lynaghk (Keming Labs)
Thursday, June 23, 2011 21
22. Exploratorium #1 Patent Networks
citations
amongst
top 10k
most cited
patents
Thursday, June 23, 2011 22
23. Grab the graph data:
~/exploratorium/patents/toplinks.dot
Graphviz Art is Pretty!
Thursday, June 23, 2011 23
24. GraphViz Can
Graph really big
graphs... but they
get hard to use ->
<- Psychedelic
Patents
Thursday, June 23, 2011 24
25. Graphviz - Play with Graphs
(http://www.graphviz.org)
• sudo port install graphviz or sudo apt-get install graphviz
• graphing commands: dot,neato,twopi,circo,fdp
• dot -Tpdf -o file.dot
• More options here:
• http://www.graphviz.org/content/command-line-invocation
• Fun options are in the .dot file:
• http://www.graphviz.org/content/dot-language
Thursday, June 23, 2011 25
26. Styling dots
• node [shape=point, width="0.15",color="#0000001c"];
• edge [arrowsize="0.50", color="#0000001c"];
• There are tons, read the docs and have fun
• You can also try more complex things
• Like constraints, time for example
• Sometimes too many constraints makes GraphViz unhappy...
Thursday, June 23, 2011 26
28. UbiGraph
• We loved UbiGraph, but don’t know an OSS alternative
• Renders many nodes in 3D in realtime FD-layout (50k+).
• 16gb of ram Mac Pro
• Shout out to Apple: thank you for supporting our research!
• It’s ‘free’ but development has stalled and since it’s closed source we can’t
build on it!
• Alternatives?
Thursday, June 23, 2011 28
29. Exploratorium #2
• Making graphs of language using python, redis, R and a bunch of awesome
libraries
• Thanks
• @hadleywickham
• @homerstrong
• @antirez
• Bryan Lewis (http://illposed.net/)
Thursday, June 23, 2011 29
39. Store the data
Postgres is not too shabby
Thursday, June 23, 2011 35
40. Store the data
SELECT cite AS patent_num, count FROM (SELECT cite,
count(*) AS count FROM citations GROUP BY cite) AS t1
ORDER BY t1.count DESC LIMIT 10
Thursday, June 23, 2011 36
41. Store the data
SELECT `cite`, count(*), `year` FROM `citations`
INNER JOIN (SELECT date_part('year', `grantdate`) AS
`year`, `patent_num` AS `patent_num` FROM `patents`)
AS `t1` USING (`patent_num`) WHERE (cite IN (12345))
GROUP BY `year`, `cite`
Thursday, June 23, 2011 37
42. Store the data
SELECT term, count FROM (SELECT term, count(*) FROM
(SELECT patent_num, term FROM tfidfs WHERE (tfidf >
0.05)) AS "t1" INNER JOIN (SELECT * FROM (SELECT
patent_num FROM patent_lengths WHERE (wordcount >
10)) AS "t1" INNER JOIN (SELECT * FROM patents WHERE
(grantdate > '1990-01-01' AND grantdate <
'2000-01-01')) AS "t2" USING ("patent_num")) AS "t2"
USING ("patent_num") GROUP BY "term") AS "t3" ORDER
BY count DESC LIMIT 50;
Thursday, June 23, 2011 38
62. Reddit
• Count words by hour
Thursday, June 23, 2011 50
63. Reddit
• Count words by hour
• Comment network
Thursday, June 23, 2011 50
64. Reddit
• Count words by hour
• Comment network
• User network
Thursday, June 23, 2011 50
65. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
• Comment network
• User network
Thursday, June 23, 2011 50
66. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network
• User network
Thursday, June 23, 2011 50
67. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
• User network
Thursday, June 23, 2011 50
68. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network
Thursday, June 23, 2011 50
69. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
Thursday, June 23, 2011 50
70. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
Thursday, June 23, 2011 50
71. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
SET subreddit:threads
Thursday, June 23, 2011 50
72. Reddit
• Count words by hour ZSET subreddit:2011-06-21:12
word [count]
• Comment network SET thread_id:comments
“parent_id:child_id”
• User network SET thread_id:users
“parent_id:child_id”
SET subreddit:threads
thread_id
Thursday, June 23, 2011 50
75. Reddit
Go forth and graph!
#exploratorium #osb11
Thursday, June 23, 2011 53
76. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
Thursday, June 23, 2011 53
77. Reddit
Go forth and graph!
#exploratorium #osb11
We will hire you.
For reals.
Thursday, June 23, 2011 53
78. You Are Now Leaving
the Big Data
Exploratorium
Please ensure you have your
valuables.
Noah Pepper @noahmp
Devin Chalmers @qwzybug
#exploratorium #osb11
Thursday, June 23, 2011 54