OpenRefine - Data Science Training for Librarians

8,570 views

Published on

OpenRefine - Data Science Training for Librarians

  1. 1. Data Scientist Training for Librarians Harvard College Observatory March 28, 2013 Tom Morris @tfmorris
  2. 2. Who am I?• Independent software engineering & product management consultant• Developer on open source OpenRefine project• Curious data geek• Contact: – Twitter: @tfmorris – Email: tfmorris@gmail.com2013-03-28 Tom Morris @tfmorris 2
  3. 3. Data Analysis Lifecycle ● Find / Extract ● Prepare – Characterize – Clean – Integrate / Extend ● Analyze ● Visualize / Report2013-03-28 Tom Morris @tfmorris 3
  4. 4. Provenance ● Provenance is key! - both before and after you get data ● Record source (e.g. download URL) and date ● Unix command line – build up a repeatable transformation pipeline script – use make to keep from having to repeat steps) ● OpenRefine maintains an undo history (but...)2013-03-28 Tom Morris @tfmorris 4
  5. 5. Irreversible Transforms ● Be careful of anything which isnt reversible ● Keep source files and plan recovery strategy ● Common gotchas: – Character encoding – cant replace substitution character with its original value – Leading 0s on identifiers2013-03-28 Tom Morris @tfmorris 5
  6. 6. Provenance projects ● Stanford Panda (Provenance and Data) - http://infolab.stanford.edu/panda/ ● Open Provenance Model - http://openprovenance.org/ ● Both focus on bi-directional traceability2013-03-28 Tom Morris @tfmorris 6
  7. 7. Tools vs Scale ● Editor with macro facility: emacs, vim ● Spreadsheet: Excel, OO Calc ● OpenRefine ● Unix shell commands – awk, sed, grep, cut, sort, head, tail ● “Real” programming – Python, Ruby, Java ● Map-Reduce2013-03-28 Tom Morris @tfmorris 7
  8. 8. Regular Expressions ● Useful in so many contexts ● A little confusing to learn, but ● Absolutely worth the effort!2013-03-28 Tom Morris @tfmorris 8
  9. 9. OpenRefine ● Power tool for working with messy data ● Free and open source ● Desktop based (data stays private) ● Faceted browsing interface ● Lots of input & output formats ● Powerful transformations ● Useful for analysis & web scraping/APIs too2013-03-28 Tom Morris @tfmorris 9
  10. 10. OpenRefine Data Formats ● CSV/TSV/separator based ● Fixed width field ● JSON & XML ● Excel & OpenOffice Calc ● Google Spreadsheets & Fusion Tables ● RDF ● URLs & zip files too!2013-03-28 Tom Morris @tfmorris 10
  11. 11. Data Characterization ● Coded vs free-form fields ● Distribution of values – Missing values – skip, impute, ... – Outliers – cause? Can they be rescaled? ● Delimiters & escaping (e.g. HTML, XML) ● Formatting problems ● Character encoding issues?2013-03-28 Tom Morris @tfmorris 11
  12. 12. Hands-on ● Lets play with some data! ● http://code.google.com/p/google-refine/2013-03-28 Tom Morris @tfmorris 12
  13. 13. Export ● OpenRefine exports most import formats: Excel, CSV, TSV, OpenOffice, Google Spreadsheets, Fusion tables, JSON, RDF ● Template-based exporter for everything else: custom JSON formats, etc.2013-03-28 Tom Morris @tfmorris 13
  14. 14. Scaling Up ● Experiment with a (representative) sample of your data ● Reuse regexs, filters, etc with more heavy duty tools – awk, sed, Map-Reduce2013-03-28 Tom Morris @tfmorris 14
  15. 15. Resources ● Berkeley Data Science course http://datascienc.es/schedule/ – week 2 - Data Preparation has good R examples http://berkeleydatascience.files.wordpress.com/2012/02/2012 ● Mike Loukides "Data Hand Tools" http://radar.oreilly.com/2011/04/data-hand-tools ● Jeremy Howard Getting in shape for the sport of Data Science http://media.kaggle.com/MelbURN.html2013-03-28 Tom Morris @tfmorris 15
  16. 16. More resources● MIT IAP Data Science course materials – http://dataiap.github.com/dataiap/● Quora – http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning● OKFN School of Data handbook – http://handbook.schoolofdata.org● Hilary Mason – http://www.dataists.com/2010/09/a-taxonomy-of-data-science/2013-03-28 Tom Morris @tfmorris 16
  17. 17. Resources mentioned ● Harvard Business Review competition at Kaggle – Competition ends 8/27/2012 4:00 AM UTC ! – https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect ● Stanford Data Wrangler – http://vis.stanford.edu/wrangler/2013-03-28 Tom Morris @tfmorris 17
  18. 18. Thanks!• Questions now?• Questions later: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 18

×