Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data Scientist Training for Librarians   Harvard College Observatory          March 28, 2013          Tom Morris          ...
Who am I?•    Independent software engineering & product    management consultant•    Developer on open source OpenRefine ...
Data Analysis Lifecycle ●     Find / Extract ●     Prepare     –   Characterize     –   Clean     –   Integrate / Extend ●...
Provenance ●     Provenance is key! - both before and after     you get data ●     Record source (e.g. download URL) and d...
Irreversible Transforms ●     Be careful of anything which isnt reversible ●     Keep source files and plan recovery strat...
Provenance projects ●     Stanford Panda (Provenance and Data) -     http://infolab.stanford.edu/panda/ ●     Open Provena...
Tools vs Scale ●     Editor with macro facility: emacs, vim ●     Spreadsheet: Excel, OO Calc ●     OpenRefine ●     Unix ...
Regular Expressions ●     Useful in so many contexts ●     A little confusing to learn, but ●     Absolutely worth the eff...
OpenRefine ●     Power tool for working with messy data ●     Free and open source ●     Desktop based (data stays private...
OpenRefine Data Formats ●     CSV/TSV/separator based ●     Fixed width field ●     JSON & XML ●     Excel & OpenOffice Ca...
Data Characterization ●     Coded vs free-form fields ●     Distribution of values     –   Missing values – skip, impute, ...
Hands-on ●     Lets play with some data! ●     http://code.google.com/p/google-refine/2013-03-28          Tom Morris @tfmo...
Export ●     OpenRefine exports most import formats:     Excel, CSV, TSV, OpenOffice, Google     Spreadsheets, Fusion tabl...
Scaling Up ●     Experiment with a (representative) sample of     your data ●     Reuse regexs, filters, etc with more hea...
Resources ●     Berkeley Data Science course     http://datascienc.es/schedule/     –   week 2 - Data Preparation has good...
More resources●    MIT IAP Data Science course materials    –   http://dataiap.github.com/dataiap/●    Quora    –   http:/...
Resources mentioned ●     Harvard Business Review competition at     Kaggle     –   Competition ends 8/27/2012 4:00 AM UTC...
Thanks!•     Questions now?•     Questions later:      –   Twitter: @tfmorris      –   Email: tfmorris@gmail.com    2013-0...
Upcoming SlideShare
Loading in …5
×

OpenRefine - Data Science Training for Librarians

10,257 views

Published on

OpenRefine - Data Science Training for Librarians

  1. 1. Data Scientist Training for Librarians Harvard College Observatory March 28, 2013 Tom Morris @tfmorris
  2. 2. Who am I?• Independent software engineering & product management consultant• Developer on open source OpenRefine project• Curious data geek• Contact: – Twitter: @tfmorris – Email: tfmorris@gmail.com2013-03-28 Tom Morris @tfmorris 2
  3. 3. Data Analysis Lifecycle ● Find / Extract ● Prepare – Characterize – Clean – Integrate / Extend ● Analyze ● Visualize / Report2013-03-28 Tom Morris @tfmorris 3
  4. 4. Provenance ● Provenance is key! - both before and after you get data ● Record source (e.g. download URL) and date ● Unix command line – build up a repeatable transformation pipeline script – use make to keep from having to repeat steps) ● OpenRefine maintains an undo history (but...)2013-03-28 Tom Morris @tfmorris 4
  5. 5. Irreversible Transforms ● Be careful of anything which isnt reversible ● Keep source files and plan recovery strategy ● Common gotchas: – Character encoding – cant replace substitution character with its original value – Leading 0s on identifiers2013-03-28 Tom Morris @tfmorris 5
  6. 6. Provenance projects ● Stanford Panda (Provenance and Data) - http://infolab.stanford.edu/panda/ ● Open Provenance Model - http://openprovenance.org/ ● Both focus on bi-directional traceability2013-03-28 Tom Morris @tfmorris 6
  7. 7. Tools vs Scale ● Editor with macro facility: emacs, vim ● Spreadsheet: Excel, OO Calc ● OpenRefine ● Unix shell commands – awk, sed, grep, cut, sort, head, tail ● “Real” programming – Python, Ruby, Java ● Map-Reduce2013-03-28 Tom Morris @tfmorris 7
  8. 8. Regular Expressions ● Useful in so many contexts ● A little confusing to learn, but ● Absolutely worth the effort!2013-03-28 Tom Morris @tfmorris 8
  9. 9. OpenRefine ● Power tool for working with messy data ● Free and open source ● Desktop based (data stays private) ● Faceted browsing interface ● Lots of input & output formats ● Powerful transformations ● Useful for analysis & web scraping/APIs too2013-03-28 Tom Morris @tfmorris 9
  10. 10. OpenRefine Data Formats ● CSV/TSV/separator based ● Fixed width field ● JSON & XML ● Excel & OpenOffice Calc ● Google Spreadsheets & Fusion Tables ● RDF ● URLs & zip files too!2013-03-28 Tom Morris @tfmorris 10
  11. 11. Data Characterization ● Coded vs free-form fields ● Distribution of values – Missing values – skip, impute, ... – Outliers – cause? Can they be rescaled? ● Delimiters & escaping (e.g. HTML, XML) ● Formatting problems ● Character encoding issues?2013-03-28 Tom Morris @tfmorris 11
  12. 12. Hands-on ● Lets play with some data! ● http://code.google.com/p/google-refine/2013-03-28 Tom Morris @tfmorris 12
  13. 13. Export ● OpenRefine exports most import formats: Excel, CSV, TSV, OpenOffice, Google Spreadsheets, Fusion tables, JSON, RDF ● Template-based exporter for everything else: custom JSON formats, etc.2013-03-28 Tom Morris @tfmorris 13
  14. 14. Scaling Up ● Experiment with a (representative) sample of your data ● Reuse regexs, filters, etc with more heavy duty tools – awk, sed, Map-Reduce2013-03-28 Tom Morris @tfmorris 14
  15. 15. Resources ● Berkeley Data Science course http://datascienc.es/schedule/ – week 2 - Data Preparation has good R examples http://berkeleydatascience.files.wordpress.com/2012/02/2012 ● Mike Loukides "Data Hand Tools" http://radar.oreilly.com/2011/04/data-hand-tools ● Jeremy Howard Getting in shape for the sport of Data Science http://media.kaggle.com/MelbURN.html2013-03-28 Tom Morris @tfmorris 15
  16. 16. More resources● MIT IAP Data Science course materials – http://dataiap.github.com/dataiap/● Quora – http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning● OKFN School of Data handbook – http://handbook.schoolofdata.org● Hilary Mason – http://www.dataists.com/2010/09/a-taxonomy-of-data-science/2013-03-28 Tom Morris @tfmorris 16
  17. 17. Resources mentioned ● Harvard Business Review competition at Kaggle – Competition ends 8/27/2012 4:00 AM UTC ! – https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect ● Stanford Data Wrangler – http://vis.stanford.edu/wrangler/2013-03-28 Tom Morris @tfmorris 17
  18. 18. Thanks!• Questions now?• Questions later: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 18

×