Data Scientist Training for Librarians
   Harvard College Observatory
          March 28, 2013

          Tom Morris
           @tfmorris
Who am I?
•
    Independent software engineering & product
    management consultant
•
    Developer on open source OpenRefine project
•
    Curious data geek
•
    Contact:
     –   Twitter: @tfmorris
     –   Email: tfmorris@gmail.com

2013-03-28           Tom Morris @tfmorris         2
Data Analysis Lifecycle
 ●
     Find / Extract
 ●
     Prepare
     –   Characterize
     –   Clean
     –   Integrate / Extend
 ●
     Analyze
 ●
     Visualize / Report

2013-03-28              Tom Morris @tfmorris   3
Provenance
 ●
     Provenance is key! - both before and after
     you get data
 ●
     Record source (e.g. download URL) and date
 ●
     Unix command line
     –   build up a repeatable transformation pipeline script
     –   use make to keep from having to repeat steps)

 ●
     OpenRefine maintains an undo history (but...)


2013-03-28                        Tom Morris @tfmorris          4
Irreversible Transforms
 ●
     Be careful of anything which isn't reversible
 ●
     Keep source files and plan recovery strategy
 ●
     Common gotchas:
     –   Character encoding – can't replace
         substitution character with its original
         value
     –   Leading 0s on identifiers

2013-03-28            Tom Morris @tfmorris           5
Provenance projects
 ●
     Stanford Panda (Provenance and Data) -
     http://infolab.stanford.edu/panda/
 ●
     Open Provenance Model -
     http://openprovenance.org/
 ●
     Both focus on bi-directional traceability




2013-03-28           Tom Morris @tfmorris        6
Tools vs Scale
 ●
     Editor with macro facility: emacs, vim
 ●
     Spreadsheet: Excel, OO Calc
 ●
     OpenRefine
 ●
     Unix shell commands – awk, sed, grep, cut,
     sort, head, tail
 ●
     “Real” programming – Python, Ruby, Java
 ●
     Map-Reduce

2013-03-28          Tom Morris @tfmorris          7
Regular Expressions
 ●
     Useful in so many contexts
 ●
     A little confusing to learn, but
 ●
     Absolutely worth the effort!




2013-03-28           Tom Morris @tfmorris   8
OpenRefine
 ●
     Power tool for working with messy data
 ●
     Free and open source
 ●
     Desktop based (data stays private)
 ●
     Faceted browsing interface
 ●
     Lots of input & output formats
 ●
     Powerful transformations
 ●
     Useful for analysis & web scraping/APIs too
2013-03-28          Tom Morris @tfmorris           9
OpenRefine Data Formats
 ●
     CSV/TSV/separator based
 ●
     Fixed width field
 ●
     JSON & XML
 ●
     Excel & OpenOffice Calc
 ●
     Google Spreadsheets & Fusion Tables
 ●
     RDF
 ●
     URLs & zip files too!
2013-03-28           Tom Morris @tfmorris   10
Data Characterization
 ●
     Coded vs free-form fields
 ●
     Distribution of values
     –   Missing values – skip, impute, ...
     –   Outliers – cause? Can they be rescaled?
 ●
     Delimiters & escaping (e.g. HTML, XML)
 ●
     Formatting problems
 ●
     Character encoding issues?

2013-03-28                 Tom Morris @tfmorris    11
Hands-on
 ●
     Let's play with some data!
 ●
     http://code.google.com/p/google-refine/




2013-03-28          Tom Morris @tfmorris       12
Export
 ●
     OpenRefine exports most import formats:
     Excel, CSV, TSV, OpenOffice, Google
     Spreadsheets, Fusion tables, JSON, RDF
 ●
     Template-based exporter for everything else:
     custom JSON formats, etc.




2013-03-28          Tom Morris @tfmorris            13
Scaling Up
 ●
     Experiment with a (representative) sample of
     your data
 ●
     Reuse regexs, filters, etc with more heavy
     duty tools – awk, sed, Map-Reduce




2013-03-28          Tom Morris @tfmorris            14
Resources
 ●
     Berkeley Data Science course
     http://datascienc.es/schedule/
     –   week 2 - Data Preparation has good R examples
         http://berkeleydatascience.files.wordpress.com/2012/02/2012
 ●
     Mike Loukides "Data Hand Tools"
     http://radar.oreilly.com/2011/04/data-hand-tools
 ●
     Jeremy Howard Getting in shape for the sport
     of Data Science
     http://media.kaggle.com/MelbURN.html
2013-03-28              Tom Morris @tfmorris                  15
More resources
●
    MIT IAP Data Science course materials
    –   http://dataiap.github.com/dataiap/
●
    Quora
    –   http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public
    –   http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning

●
    OKFN School of Data handbook
    –   http://handbook.schoolofdata.org
●
    Hilary Mason
    –   http://www.dataists.com/2010/09/a-taxonomy-of-data-science/
2013-03-28                          Tom Morris @tfmorris                                    16
Resources mentioned
 ●
     Harvard Business Review competition at
     Kaggle
     –   Competition ends 8/27/2012 4:00 AM UTC !
     –   https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect


 ●
     Stanford Data Wrangler
     –   http://vis.stanford.edu/wrangler/




2013-03-28                    Tom Morris @tfmorris                             17
Thanks!

•
     Questions now?
•
     Questions later:
      –   Twitter: @tfmorris
      –   Email: tfmorris@gmail.com




    2013-03-28          Tom Morris @tfmorris   18

OpenRefine - Data Science Training for Librarians

  • 1.
    Data Scientist Trainingfor Librarians Harvard College Observatory March 28, 2013 Tom Morris @tfmorris
  • 2.
    Who am I? • Independent software engineering & product management consultant • Developer on open source OpenRefine project • Curious data geek • Contact: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 2
  • 3.
    Data Analysis Lifecycle ● Find / Extract ● Prepare – Characterize – Clean – Integrate / Extend ● Analyze ● Visualize / Report 2013-03-28 Tom Morris @tfmorris 3
  • 4.
    Provenance ● Provenance is key! - both before and after you get data ● Record source (e.g. download URL) and date ● Unix command line – build up a repeatable transformation pipeline script – use make to keep from having to repeat steps) ● OpenRefine maintains an undo history (but...) 2013-03-28 Tom Morris @tfmorris 4
  • 5.
    Irreversible Transforms ● Be careful of anything which isn't reversible ● Keep source files and plan recovery strategy ● Common gotchas: – Character encoding – can't replace substitution character with its original value – Leading 0s on identifiers 2013-03-28 Tom Morris @tfmorris 5
  • 6.
    Provenance projects ● Stanford Panda (Provenance and Data) - http://infolab.stanford.edu/panda/ ● Open Provenance Model - http://openprovenance.org/ ● Both focus on bi-directional traceability 2013-03-28 Tom Morris @tfmorris 6
  • 7.
    Tools vs Scale ● Editor with macro facility: emacs, vim ● Spreadsheet: Excel, OO Calc ● OpenRefine ● Unix shell commands – awk, sed, grep, cut, sort, head, tail ● “Real” programming – Python, Ruby, Java ● Map-Reduce 2013-03-28 Tom Morris @tfmorris 7
  • 8.
    Regular Expressions ● Useful in so many contexts ● A little confusing to learn, but ● Absolutely worth the effort! 2013-03-28 Tom Morris @tfmorris 8
  • 9.
    OpenRefine ● Power tool for working with messy data ● Free and open source ● Desktop based (data stays private) ● Faceted browsing interface ● Lots of input & output formats ● Powerful transformations ● Useful for analysis & web scraping/APIs too 2013-03-28 Tom Morris @tfmorris 9
  • 10.
    OpenRefine Data Formats ● CSV/TSV/separator based ● Fixed width field ● JSON & XML ● Excel & OpenOffice Calc ● Google Spreadsheets & Fusion Tables ● RDF ● URLs & zip files too! 2013-03-28 Tom Morris @tfmorris 10
  • 11.
    Data Characterization ● Coded vs free-form fields ● Distribution of values – Missing values – skip, impute, ... – Outliers – cause? Can they be rescaled? ● Delimiters & escaping (e.g. HTML, XML) ● Formatting problems ● Character encoding issues? 2013-03-28 Tom Morris @tfmorris 11
  • 12.
    Hands-on ● Let's play with some data! ● http://code.google.com/p/google-refine/ 2013-03-28 Tom Morris @tfmorris 12
  • 13.
    Export ● OpenRefine exports most import formats: Excel, CSV, TSV, OpenOffice, Google Spreadsheets, Fusion tables, JSON, RDF ● Template-based exporter for everything else: custom JSON formats, etc. 2013-03-28 Tom Morris @tfmorris 13
  • 14.
    Scaling Up ● Experiment with a (representative) sample of your data ● Reuse regexs, filters, etc with more heavy duty tools – awk, sed, Map-Reduce 2013-03-28 Tom Morris @tfmorris 14
  • 15.
    Resources ● Berkeley Data Science course http://datascienc.es/schedule/ – week 2 - Data Preparation has good R examples http://berkeleydatascience.files.wordpress.com/2012/02/2012 ● Mike Loukides "Data Hand Tools" http://radar.oreilly.com/2011/04/data-hand-tools ● Jeremy Howard Getting in shape for the sport of Data Science http://media.kaggle.com/MelbURN.html 2013-03-28 Tom Morris @tfmorris 15
  • 16.
    More resources ● MIT IAP Data Science course materials – http://dataiap.github.com/dataiap/ ● Quora – http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public – http://www.quora.com/What-are-some-good-methods-for-data-pre-processing-in-machine-learning ● OKFN School of Data handbook – http://handbook.schoolofdata.org ● Hilary Mason – http://www.dataists.com/2010/09/a-taxonomy-of-data-science/ 2013-03-28 Tom Morris @tfmorris 16
  • 17.
    Resources mentioned ● Harvard Business Review competition at Kaggle – Competition ends 8/27/2012 4:00 AM UTC ! – https://www.kaggle.com/c/harvard-business-review-vision-statement-prospect ● Stanford Data Wrangler – http://vis.stanford.edu/wrangler/ 2013-03-28 Tom Morris @tfmorris 17
  • 18.
    Thanks! • Questions now? • Questions later: – Twitter: @tfmorris – Email: tfmorris@gmail.com 2013-03-28 Tom Morris @tfmorris 18