Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Let your data shine… with OpenRefine
Open Belgium 2016
OpenRefine workshop
Brosens - Desmet
What people say: tweets
@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data wit...
@LearonDalby: "I'm sold on #Google #Refine used it most of the day with "messy" data and managed to clean nearly all of
it...
"Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to
exis...
● Formerly known as Google Refine, now OpenRefine
● Site: http://openrefine.org
● Github: https://github.com/OpenRefine
● ...
● Supported by a large community (lots of tutorials and plugins)
● Works quite well up to 100.000 rows of data
● Supports ...
Other tools OpenRefine
Worksheet focus on cells focus on rows and columns
focus on import data &
calculations
focus on exp...
Distribution Description Authors
LODRefine LODRefine is actually OpenRefine with integrated extensions that make transitio...
● Download Google Refine on: http://openrefine.org/download.html
● Launch Google Refine
● Create a project
● Choose the fi...
● Check the preview and define parsing
○ Set character encoding (UTF8)
○ Choose delimiter (/t ; , …)
○ Parse data as (csv)...
● Accessing information organized according to a faceted classification system
○ Creating an overview of the data
○ Allows...
● Clustering allows to automatically group and edit different but similar values
Hands on: clustering
● Common transforms:
○ to number
○ trim leading and trailing whitespace
○ to title case; to date; to number
● Split & Join...
● Split columns (by separator or field length)
● Add columns (by fetching urls or based on column) (use GREL)
● Move colum...
● GREL (google refine expression language)
○ add columns based on other column
■ basic string modification
■ find and repl...
● Add columns by fetching url
■ find and replace
■ string parsing & splitting
■ add column based on column”straat” (value+...
● Grouping concepts with an external service, eg taxonomic reconciliation
○ Example from the natural environment (biodiver...
● Grouping concepts with an external service, eg taxonomic reconciliation
○ Example from the natural environment (biodiver...
● Merge data from the two projects by creating a new column from values from
an existing column within one project that ar...
● Extract and save parts of your operation history as JSON that you can apply
to this or other projects in the future.
Han...
● https://github.com/OpenRefine/OpenRefine/wiki
● https://github.com/OpenRefine/OpenRefine/wiki/Recipes
● http://enipedia....
Upcoming SlideShare
Loading in …5
×

Let your data shine... with OpenRefine

612 views

Published on

Let your data shine... with OpenRefine

  • Be the first to comment

Let your data shine... with OpenRefine

  1. 1. Let your data shine… with OpenRefine Open Belgium 2016 OpenRefine workshop Brosens - Desmet
  2. 2. What people say: tweets @bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http: //goo.gl/UniR6 #cleanup #tools" view tweet @Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet @kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet @litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http: //goo.gl/8pvKB #datageek" view tweet
  3. 3. @LearonDalby: "I'm sold on #Google #Refine used it most of the day with "messy" data and managed to clean nearly all of it." view tweet @roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet @Salesient: "Google refine is awesome. Never before have I been home this early." view tweet @Mayin: "Not only will it clean your data, Google Refine will slice, dice and put bows on your hairdo!http://bit.ly/cPGn1E Rocks data exploration." view tweet @marklabedz: "Google Refine: Making interns unneccesary since 2010." view tweet @naterkane: "i'm completely in love with Google Refine. fo' reals." view tweet @LearonDalby: "Using #Google #Refine makes me happy. Even for the easy stuff." view tweet @loranstefani: "Google Refine: love at first click" view tweet @tracystan: "Google Refine is gonna change my life" view tweet What people say: tweets
  4. 4. "Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to existing records, it can be a powerful tool for transparency." Rebekah Heacock, co-director of the Technology for Transparency Network and a Project Coordinator at Harvard’s Berkman Center for Internet and Society - Sunlight Foundation, Tools for transparency: Google Refine. "Google Refine is an immensely powerful tool for dealing with "messy" data, and it sports a myriad of advanced features for massaging and analyzing complex data sets" Dmitri Popov (Linux Magazine) - Use Google Refine to Massage Your Data "For anyone who’s ever had to sort through messy data to try to turn up a meaningful treatment, and who hasn’t, this tool is a godsend." Michael Lines, SLAW - Google Refine 2.0 "Google Refine 2.0 will serve an excellent back-end for data visualization services. It has been well received by the Chicago Tribune and open-government data communities. Along with Google Squared, Refine 2.0 can create a powerful research tool." Chinmoy Kanjilal, Techie Buzz - Google Refine 2.0: Power Tools for Working With Data What people say: blogs
  5. 5. ● Formerly known as Google Refine, now OpenRefine ● Site: http://openrefine.org ● Github: https://github.com/OpenRefine ● Used for ○ Data cleaning (detect and correct anomalies) ○ Transform data (change format, change datatype) ○ “Pimp” & “link” data (harvest & connect data from online databases) ● More powerful than a worksheet ● More visual than scripting A free, open source, powerful tool for working with messy data
  6. 6. ● Supported by a large community (lots of tutorials and plugins) ● Works quite well up to 100.000 rows of data ● Supports several file formats ● The original file is unaffected ● OpenRefine runs in a modern browser, but does not require an internet connection (except when you connect to services) A free, open source, powerful tool for working with messy data
  7. 7. Other tools OpenRefine Worksheet focus on cells focus on rows and columns focus on import data & calculations focus on exploring and transforming existing data Scripting data → script → output all steps are visualized focus on transformation of data Databases focus on queries looks like a worksheet you should know the data data is always visible, facets shows you choices OpenRefine vs other tools
  8. 8. Distribution Description Authors LODRefine LODRefine is actually OpenRefine with integrated extensions that make transition from tabular data to Linked Data a bit easier. Integrated extensions are: RDF extension, DBpedia extension, Crowdsourcing extension, Stats extension Sparkica OpenDataRise Tool to cleanse and semantify datasets from CKAN repositories. Based on OpenRefine. Open Data in Trentino p3-batchrefine BatchRefine adds batch processing capabilities to OpenRefine and support multiple back end including spark SpazioDati SparkonRefine RefineOnSpark is a driver program to run OpenRefine jobs on the Spark cluster SpazioDati Reconciliation-and-Matching- Framework A framework to allow the matching of string entities using customised sets of transformations and matchers, plus a tool to produce the necessary configurations and another to expose them as OpenRefine reconciliation services. RBGKew Tools working with OpenRefine
  9. 9. ● Download Google Refine on: http://openrefine.org/download.html ● Launch Google Refine ● Create a project ● Choose the file you want to clean (Example Dataset: Onderwijsaanbod in Vlaanderen (http://opendata.vlaanderen.be/dataset/onderwijsaanbod) Hands on: install OpenRefine
  10. 10. ● Check the preview and define parsing ○ Set character encoding (UTF8) ○ Choose delimiter (/t ; , …) ○ Parse data as (csv) ○ Parse first line as column header, ignore first … line(s).... Hands on: importing data
  11. 11. ● Accessing information organized according to a faceted classification system ○ Creating an overview of the data ○ Allows targeted editing of your data ○ Allows specific filtering ○ Facet choices as tab separated values (like pivot tables in Excel) Hands on: faceting
  12. 12. ● Clustering allows to automatically group and edit different but similar values Hands on: clustering
  13. 13. ● Common transforms: ○ to number ○ trim leading and trailing whitespace ○ to title case; to date; to number ● Split & Join multi valued cells Hands on: edit cells
  14. 14. ● Split columns (by separator or field length) ● Add columns (by fetching urls or based on column) (use GREL) ● Move columns ● Remove columns ● Rename columns Hands on: edit columns
  15. 15. ● GREL (google refine expression language) ○ add columns based on other column ■ basic string modification ■ find and replace ■ string parsing and splitting ■ calling web services ○ Result are always visible in the Preview Hands on: scripting using GREL
  16. 16. ● Add columns by fetching url ■ find and replace ■ string parsing & splitting ■ add column based on column”straat” (value+”%20”+cells[‘huisnummer’].value) ■ Call google API (or openstreetmap or….) ("https://maps.googleapis. com/maps/api/geocode/json?address="+value+ cells["huisnummer"]. value&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous") ■ Parse JSON (value.parseJson()["results"][0]["geometry"]["location"]["lng"]) Hands on: georeferencing
  17. 17. ● Grouping concepts with an external service, eg taxonomic reconciliation ○ Example from the natural environment (biodiversity data) ■ add a reconciliation service (reconcile, start reconciling) ■ Let’s use Encyclopedia of Life ■ Select Matches (Facet, Quick actions…) Hands on: reconciling
  18. 18. ● Grouping concepts with an external service, eg taxonomic reconciliation ○ Example from the natural environment (biodiversity data) ■ add ID EOL ID column (GREL) cell.recon.match.id ■ create url based on EOL ID ■ http://eol.org/pages/3465521 Hands on: reconciling
  19. 19. ● Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project ○ cell.cross("datasetname.csv","scientificName").cells["order"].value[0] Hands on: cross referencing
  20. 20. ● Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future. Hands on: Extract operation history
  21. 21. ● https://github.com/OpenRefine/OpenRefine/wiki ● https://github.com/OpenRefine/OpenRefine/wiki/Recipes ● http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial ● ... Hands on: further reading

×