• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
OpenRefine - Data Science Training for Librarians
 

OpenRefine - Data Science Training for Librarians

on

  • 1,209 views

 

Statistics

Views

Total Views
1,209
Views on SlideShare
1,209
Embed Views
0

Actions

Likes
3
Downloads
18
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as OpenOffice

Usage Rights

CC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike LicenseCC Attribution-NonCommercial-ShareAlike License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    OpenRefine - Data Science Training for Librarians OpenRefine - Data Science Training for Librarians Presentation Transcript

    • Data Scientist Training for Librarians Harvard College Observatory March 28, 2013 Tom Morris @tfmorris
    • Who am I?• Independent software engineering & product management consultant• Developer on open source OpenRefine project• Curious data geek• Contact: – Twitter: @tfmorris – Email: tfmorris@gmail.com2013-03-28 Tom Morris @tfmorris 2
    • Data Analysis Lifecycle ● Find / Extract ● Prepare – Characterize – Clean – Integrate / Extend ● Analyze ● Visualize / Report2013-03-28 Tom Morris @tfmorris 3
    • Provenance ● Provenance is key! - both before and after you get data ● Record source (e.g. download URL) and date ● Unix command line – build up a repeatable transformation pipeline script – use make to keep from having to repeat steps) ● OpenRefine maintains an undo history (but...)2013-03-28 Tom Morris @tfmorris 4
    • Irreversible Transforms ● Be careful of anything which isnt reversible ● Keep source files and plan recovery strategy ● Common gotchas: – Character encoding – cant replace substitution character with its original value – Leading 0s on identifiers2013-03-28 Tom Morris @tfmorris 5
    • Provenance projects ● Stanford Panda (Provenance and Data) - http://infolab.stanford.edu/panda/ ● Open Provenance Model - http://openprovenance.org/ ● Both focus on bi-directional traceability2013-03-28 Tom Morris @tfmorris 6
    • Tools vs Scale ● Editor with macro facility: emacs, vim ● Spreadsheet: Excel, OO Calc ● OpenRefine ● Unix shell commands – awk, sed, grep, cut, sort, head, tail ● “Real” programming – Python, Ruby, Java ● Map-Reduce2013-03-28 Tom Morris @tfmorris 7
    • Regular Expressions ● Useful in so many contexts ● A little confusing to learn, but ● Absolutely worth the effort!2013-03-28 Tom Morris @tfmorris 8
    • OpenRefine ● Power tool for working with messy data ● Free and open source ● Desktop based (data stays private) ● Faceted browsing interface ● Lots of input & output formats ● Powerful transformations ● Useful for analysis & web scraping/APIs too2013-03-28 Tom Morris @tfmorris 9
    • OpenRefine Data Formats ● CSV/TSV/separator based ● Fixed width field ● JSON & XML ● Excel & OpenOffice Calc ● Google Spreadsheets & Fusion Tables ● RDF ● URLs & zip files too!2013-03-28 Tom Morris @tfmorris 10
    • Data Characterization ● Coded vs free-form fields ● Distribution of values – Missing values – skip, impute, ... – Outliers – cause? Can they be rescaled? ● Delimiters & escaping (e.g. HTML, XML) ● Formatting problems ● Character encoding issues?2013-03-28 Tom Morris @tfmorris 11
    • Hands-on ● Lets play with some data! ● http://code.google.com/p/google-refine/2013-03-28 Tom Morris @tfmorris 12
    • Export ● OpenRefine exports most import formats: Excel, CSV, TSV, OpenOffice, Google Spreadsheets, Fusion tables, JSON, RDF ● Template-based exporter for everything else: custom JSON formats, etc.2013-03-28 Tom Morris @tfmorris 13
    • Scaling Up ● Experiment with a (representative) sample of your data ● Reuse regexs, filters, etc with more heavy duty tools – awk, sed, Map-Reduce2013-03-28 Tom Morris @tfmorris 14
    • Resources ● Berkeley Data Science course http://datascienc.es/schedule/ – week 2 - Data Preparation has good R examples http://berkeleydatascience.files.wordpress.com/2012/02/2012 ● Mike Loukides "Data Hand Tools" http://radar.oreilly.com/2011/04/data-hand-tools ● Jeremy Howard Getting in shape for the sport of Data Science http://media.kaggle.com/MelbURN.html2013-03-28 Tom Morris @tfmorris 15
    • More resources● MIT IAP Data Science course materials – http://dataiap.github.com/dataiap/● Quora – http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning● OKFN School of Data handbook – http://handbook.schoolofdata.org● Hilary Mason – http://www.dataists.com/2010/09/a-taxonomy-of-data-science/2013-03-28 Tom Morris @tfmorris 16
    • Resources mentioned ● Harvard Business Review competition at Kaggle – Competition ends 8/27/2012 4:00 AM UTC ! – https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect ● Stanford Data Wrangler – http://vis.stanford.edu/wrangler/2013-03-28 Tom Morris @tfmorris 17
    • Thanks!• Questions now?• Questions later: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 18