Data wrangling with open
source tools
Tony Hirst
Dept of Communication & Systems
The Open University, UK
Premises
“I take data
from wherever I
can get it”
1
“Appropriate
everything”
2
Conversations
with data
3
Visual
Conversations
with data3
(Accession Plot)
@mediaczar
If a picture’s worth a
thousand words,
maybe it should take
as long to read?
Most learning
analytics won’t be
performed by
learning analytics
researchers
How can we help
people fashion
their own tools to
support data
conversations?
Recipes
site:open.ac.uk
Have a
conversation
with the data…
Ask the right
questions…
xkcd.com/1138
Sometimes a question
makes most sense in
the context of
questions previously
asked and answers
previously received
DATA
USERS
Educators
Learners
Planners
Marketers
Policymakers
Researchers
Press
NGOs
“
D
E
V
E
L
O
P
E
R
S
”
Have
dashboard,
so what?
A tools and
issues
based view
DATA
TOOLS
USERS
PROBLEMS
Example – Google Fusion Tables
Fusion Table
https://www.google.com/fusiontables/DataSource?docid=1VKG7iCbFlsEYJzTuQppf4xoI...
DATA
TOOLS
USERS
PROBLEMS
Access/obtain data
Make sense of data
Ask specific questions of data
Communicate in a data-centric way
Load data
Clean data
Merge/enrich data
DATA
Issues
TOOLS
DATA
Other
TOOLS
Issues
TOOLS
“Tool based
programming”
A barrier to access
(for the tool user) is
data format
JSON XMLCSVXLS
TSV
.db
HTML
PDF DOCTXT
GLUE LOGIC(Glue code)
=importHTML(URL, “table”, N)
HTML
QUERYABLE
DATA
Try it…
Example Page
http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment
http...
Google Spreadsheets as a database
Explorer
https://views.scraperwiki.com/run/google_spreadsheet_query/
http://is.gd/jiMJoh...
=importCSV(URL, N)
HTML
INTERACTIVE
DASHBOARD
Google Charts
Google Chart
Visualization API
https://code.google.com/apis/ajax/playground/
http://is.gd/TTHIUh
Google
Visualisation
API
googleVis
(R)
https://developers.facebook.com/
docs/reference/api/examples/
http://is.gd/7cRnvS
A barrier to access
(for the tool user) is
data shape
A barrier to access
(for the tool user) is
data cleanliness
Questions of
identity
The Open University
Open University
OU
Open Uni
Open University, UK
NORMALISATION/RECONCILIATION
Reconciliation to
a canonical name
and/orto a
unique identifier
A stumbling block
(for the data user)
is data enrichment
A stumbling block
(for the data user)
is joining datasets
A stumbling block
(for the data user)
is joining partially
matched data
Rolling your own
interactive data
exploration tools
R Shiny
Apps
ui.R server.R
RCharts
Many chart tools
do the work for
you if the data is
in the right shape
DATA
TOOLS
USERS PROBLEMS
Justask…
ask.SchoolOfData.org
blog.ouseful.info
@psychemedia
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Lasi datawrangling
Upcoming SlideShare
Loading in...5
×

Lasi datawrangling

3,528

Published on

Published in: Technology, Education
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,528
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
43
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide
  • I am not a journalist, but it seems to me that a large part of your work, and indeed a large part of the work of a scientist or an analyst, is in asking the right questions of a source, and knowing how to frame those questions.The data journalist knows how to ask questions of data.
  • Also – high incidence of crime around police stations (no location, so police station used as default location); Russell Square as a murder hotspot.
  • Another nice example of this, and one used by many advocates of data visualisation, is the famous example of Anscombe’s quartet, for sets of two dimensional data with some interesting properties.
  • For example, many of the “classic” summary statistics for the corresponding columns in these data sets are to all intents and purposes the same.
  • But when we look at the datasets as a set of scatterplots, we see how the data tells very different stories.
  • People learn the skills they need, as they need them.
  • Lasi datawrangling

    1. 1. Data wrangling with open source tools Tony Hirst Dept of Communication & Systems The Open University, UK
    2. 2. Premises
    3. 3. “I take data from wherever I can get it” 1
    4. 4. “Appropriate everything” 2
    5. 5. Conversations with data 3
    6. 6. Visual Conversations with data3
    7. 7. (Accession Plot) @mediaczar
    8. 8. If a picture’s worth a thousand words, maybe it should take as long to read?
    9. 9. Most learning analytics won’t be performed by learning analytics researchers
    10. 10. How can we help people fashion their own tools to support data conversations?
    11. 11. Recipes
    12. 12. site:open.ac.uk
    13. 13. Have a conversation with the data…
    14. 14. Ask the right questions…
    15. 15. xkcd.com/1138
    16. 16. Sometimes a question makes most sense in the context of questions previously asked and answers previously received
    17. 17. DATA USERS Educators Learners Planners Marketers Policymakers Researchers Press NGOs “ D E V E L O P E R S ”
    18. 18. Have dashboard, so what?
    19. 19. A tools and issues based view
    20. 20. DATA TOOLS USERS PROBLEMS
    21. 21. Example – Google Fusion Tables Fusion Table https://www.google.com/fusiontables/DataSource?docid=1VKG7iCbFlsEYJzTuQppf4xoIqq1ABxWTdW6O_7o#rows:id=1 http://is.gd/qhuaoA Walkthrough http://blog.ouseful.info/2012/11/16/a-quick-look-at-gcsealevel-certificate-awards-market-share-by-examination-board/ http://is.gd/f9YAbG
    22. 22. DATA TOOLS USERS PROBLEMS
    23. 23. Access/obtain data Make sense of data Ask specific questions of data Communicate in a data-centric way
    24. 24. Load data Clean data Merge/enrich data
    25. 25. DATA Issues TOOLS
    26. 26. DATA Other TOOLS Issues TOOLS
    27. 27. “Tool based programming”
    28. 28. A barrier to access (for the tool user) is data format
    29. 29. JSON XMLCSVXLS TSV .db HTML PDF DOCTXT
    30. 30. GLUE LOGIC(Glue code)
    31. 31. =importHTML(URL, “table”, N) HTML QUERYABLE DATA
    32. 32. Try it… Example Page http://en.wikipedia.org/wiki/List_of_colleges_and_universities_in_the_United_States_by_endowment http://is.gd/7Vbg6n
    33. 33. Google Spreadsheets as a database Explorer https://views.scraperwiki.com/run/google_spreadsheet_query/ http://is.gd/jiMJoh Walkthrough http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/ http://is.gd/qJHihu
    34. 34. =importCSV(URL, N) HTML INTERACTIVE DASHBOARD Google Charts
    35. 35. Google Chart Visualization API https://code.google.com/apis/ajax/playground/ http://is.gd/TTHIUh
    36. 36. Google Visualisation API
    37. 37. googleVis (R)
    38. 38. https://developers.facebook.com/ docs/reference/api/examples/ http://is.gd/7cRnvS
    39. 39. A barrier to access (for the tool user) is data shape
    40. 40. A barrier to access (for the tool user) is data cleanliness
    41. 41. Questions of identity
    42. 42. The Open University Open University OU Open Uni Open University, UK NORMALISATION/RECONCILIATION
    43. 43. Reconciliation to a canonical name and/orto a unique identifier
    44. 44. A stumbling block (for the data user) is data enrichment
    45. 45. A stumbling block (for the data user) is joining datasets
    46. 46. A stumbling block (for the data user) is joining partially matched data
    47. 47. Rolling your own interactive data exploration tools
    48. 48. R Shiny Apps
    49. 49. ui.R server.R
    50. 50. RCharts
    51. 51. Many chart tools do the work for you if the data is in the right shape
    52. 52. DATA TOOLS USERS PROBLEMS
    53. 53. Justask… ask.SchoolOfData.org
    54. 54. blog.ouseful.info @psychemedia
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×