Let your data shine... with OpenRefine

Open Knowledge Belgium
Open Knowledge BelgiumOpen Knowledge Belgium
Let your data shine… with OpenRefine
Open Belgium 2016
OpenRefine workshop
Brosens - Desmet
What people say: tweets
@bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http:
//goo.gl/UniR6 #cleanup #tools" view tweet
@Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much
easier." view tweet
@kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet
@litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http:
//goo.gl/8pvKB #datageek" view tweet
@LearonDalby: "I'm sold on #Google #Refine used it most of the day with "messy" data and managed to clean nearly all of
it." view tweet
@roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet
@Salesient: "Google refine is awesome. Never before have I been home this early." view tweet
@Mayin: "Not only will it clean your data, Google Refine will slice, dice and put bows on your hairdo!http://bit.ly/cPGn1E
Rocks data exploration." view tweet
@marklabedz: "Google Refine: Making interns unneccesary since 2010." view tweet
@naterkane: "i'm completely in love with Google Refine. fo' reals." view tweet
@LearonDalby: "Using #Google #Refine makes me happy. Even for the easy stuff." view tweet
@loranstefani: "Google Refine: love at first click" view tweet
@tracystan: "Google Refine is gonna change my life" view tweet
What people say: tweets
"Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to
existing records, it can be a powerful tool for transparency." Rebekah Heacock, co-director of the Technology for
Transparency Network and a Project Coordinator at Harvard’s Berkman Center for Internet and Society - Sunlight
Foundation, Tools for transparency: Google Refine.
"Google Refine is an immensely powerful tool for dealing with "messy" data, and it sports a myriad of advanced features for
massaging and analyzing complex data sets" Dmitri Popov (Linux Magazine) - Use Google Refine to Massage Your Data
"For anyone who’s ever had to sort through messy data to try to turn up a meaningful treatment, and who hasn’t, this tool is
a godsend." Michael Lines, SLAW - Google Refine 2.0
"Google Refine 2.0 will serve an excellent back-end for data visualization services. It has been well received by the Chicago
Tribune and open-government data communities. Along with Google Squared, Refine 2.0 can create a powerful research
tool." Chinmoy Kanjilal, Techie Buzz - Google Refine 2.0: Power Tools for Working With Data
What people say: blogs
● Formerly known as Google Refine, now OpenRefine
● Site: http://openrefine.org
● Github: https://github.com/OpenRefine
● Used for
○ Data cleaning (detect and correct anomalies)
○ Transform data (change format, change datatype)
○ “Pimp” & “link” data (harvest & connect data from online databases)
● More powerful than a worksheet
● More visual than scripting
A free, open source, powerful tool for working with messy data
● Supported by a large community (lots of tutorials and plugins)
● Works quite well up to 100.000 rows of data
● Supports several file formats
● The original file is unaffected
● OpenRefine runs in a modern browser, but does not require an internet
connection (except when you connect to services)
A free, open source, powerful tool for working with messy data
Other tools OpenRefine
Worksheet focus on cells focus on rows and columns
focus on import data &
calculations
focus on exploring and
transforming existing data
Scripting data → script → output all steps are visualized
focus on transformation of
data
Databases focus on queries looks like a worksheet
you should know the data data is always visible, facets
shows you choices
OpenRefine vs other tools
Distribution Description Authors
LODRefine LODRefine is actually OpenRefine with integrated extensions that make transition from
tabular data to Linked Data a bit easier. Integrated extensions are: RDF extension, DBpedia
extension, Crowdsourcing extension, Stats extension
Sparkica
OpenDataRise Tool to cleanse and semantify datasets from CKAN repositories. Based on OpenRefine. Open Data in
Trentino
p3-batchrefine BatchRefine adds batch processing capabilities to OpenRefine and support multiple back
end including spark
SpazioDati
SparkonRefine RefineOnSpark is a driver program to run OpenRefine jobs on the Spark cluster SpazioDati
Reconciliation-and-Matching-
Framework
A framework to allow the matching of string entities using customised sets of transformations
and matchers, plus a tool to produce the necessary configurations and another to expose
them as OpenRefine reconciliation services.
RBGKew
Tools working with OpenRefine
● Download Google Refine on: http://openrefine.org/download.html
● Launch Google Refine
● Create a project
● Choose the file you want to clean (Example Dataset: Onderwijsaanbod in Vlaanderen
(http://opendata.vlaanderen.be/dataset/onderwijsaanbod)
Hands on: install OpenRefine
● Check the preview and define parsing
○ Set character encoding (UTF8)
○ Choose delimiter (/t ; , …)
○ Parse data as (csv)
○ Parse first line as column header, ignore first … line(s)....
Hands on: importing data
● Accessing information organized according to a faceted classification system
○ Creating an overview of the data
○ Allows targeted editing of your data
○ Allows specific filtering
○ Facet choices as tab separated values (like pivot tables in Excel)
Hands on: faceting
● Clustering allows to automatically group and edit different but similar values
Hands on: clustering
● Common transforms:
○ to number
○ trim leading and trailing whitespace
○ to title case; to date; to number
● Split & Join multi valued cells
Hands on: edit cells
● Split columns (by separator or field length)
● Add columns (by fetching urls or based on column) (use GREL)
● Move columns
● Remove columns
● Rename columns
Hands on: edit columns
● GREL (google refine expression language)
○ add columns based on other column
■ basic string modification
■ find and replace
■ string parsing and splitting
■ calling web services
○ Result are always visible in the Preview
Hands on: scripting using GREL
● Add columns by fetching url
■ find and replace
■ string parsing & splitting
■ add column based on column”straat” (value+”%20”+cells[‘huisnummer’].value)
■ Call google API (or openstreetmap or….) ("https://maps.googleapis.
com/maps/api/geocode/json?address="+value+ cells["huisnummer"].
value&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous")
■ Parse JSON (value.parseJson()["results"][0]["geometry"]["location"]["lng"])
Hands on: georeferencing
● Grouping concepts with an external service, eg taxonomic reconciliation
○ Example from the natural environment (biodiversity data)
■ add a reconciliation service (reconcile, start reconciling)
■ Let’s use Encyclopedia of Life
■ Select Matches (Facet, Quick actions…)
Hands on: reconciling
● Grouping concepts with an external service, eg taxonomic reconciliation
○ Example from the natural environment (biodiversity data)
■ add ID EOL ID column (GREL) cell.recon.match.id
■ create url based on EOL ID
■ http://eol.org/pages/3465521
Hands on: reconciling
● Merge data from the two projects by creating a new column from values from
an existing column within one project that are used to index into a similar
column in the other project
○ cell.cross("datasetname.csv","scientificName").cells["order"].value[0]
Hands on: cross referencing
● Extract and save parts of your operation history as JSON that you can apply
to this or other projects in the future.
Hands on: Extract operation history
● https://github.com/OpenRefine/OpenRefine/wiki
● https://github.com/OpenRefine/OpenRefine/wiki/Recipes
● http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial
● ...
Hands on: further reading
1 of 21

Recommended

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio by
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioOpen Knowledge Belgium
473 views81 slides
OpenRefine Tutorial by
OpenRefine TutorialOpenRefine Tutorial
OpenRefine TutorialAlex Petralia
2.2K views10 slides
Introduction to OpenRefine by
Introduction to OpenRefineIntroduction to OpenRefine
Introduction to OpenRefineHeather Myers
1.3K views35 slides
Data Wrangling with Open Refine by
Data Wrangling with Open RefineData Wrangling with Open Refine
Data Wrangling with Open RefineLOUIS Libraries
3K views41 slides
TXDHC OpenRefine Training by
TXDHC OpenRefine TrainingTXDHC OpenRefine Training
TXDHC OpenRefine TrainingLiz Grumbach
1.5K views23 slides
OpenRefine Class Tutorial by
OpenRefine Class TutorialOpenRefine Class Tutorial
OpenRefine Class TutorialAshwin Dinoriya
549 views23 slides

More Related Content

What's hot

Using entity extraction extension with OpenRefine and Dandelion API by
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
5.9K views24 slides
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data by
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data21Style
1.4K views51 slides
Congressional PageRank: Graph Analytics of US Congress With Neo4j by
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
1.4K views65 slides
Knowledge discoverylaurahollink by
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
991 views70 slides
SSSW2015 Data Workflow Tutorial by
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
2.2K views151 slides
The Power of Semantic Technologies to Explore Linked Open Data by
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataOntotext
1.3K views51 slides

What's hot(20)

Using entity extraction extension with OpenRefine and Dandelion API by SpazioDati
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
SpazioDati5.9K views
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data by 21Style
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
21Style1.4K views
Congressional PageRank: Graph Analytics of US Congress With Neo4j by William Lyon
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
William Lyon1.4K views
Knowledge discoverylaurahollink by SSSW
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
SSSW991 views
SSSW2015 Data Workflow Tutorial by SSSW
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
SSSW2.2K views
The Power of Semantic Technologies to Explore Linked Open Data by Ontotext
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext1.3K views
The Digital Cavemen of Linked Lascaux by Ruben Verborgh
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
Ruben Verborgh3.8K views
Consuming Linked Data 4/5 Semtech2011 by Juan Sequeda
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
Juan Sequeda2.3K views
The Nature.com ontologies portal - Linked Science 2015 by Michele Pasin
The Nature.com ontologies portal - Linked Science 2015The Nature.com ontologies portal - Linked Science 2015
The Nature.com ontologies portal - Linked Science 2015
Michele Pasin1.7K views
What Factors Influence the Design of a Linked Data Generation Algorithm? by andimou
What Factors Influence the Design of a Linked Data Generation Algorithm?What Factors Influence the Design of a Linked Data Generation Algorithm?
What Factors Influence the Design of a Linked Data Generation Algorithm?
andimou152 views
A Semantic Data Model for Web Applications by Armin Haller
A Semantic Data Model for Web ApplicationsA Semantic Data Model for Web Applications
A Semantic Data Model for Web Applications
Armin Haller2.9K views
Finding Insights In Connected Data: Using Graph Databases In Journalism by William Lyon
Finding Insights In Connected Data: Using Graph Databases In JournalismFinding Insights In Connected Data: Using Graph Databases In Journalism
Finding Insights In Connected Data: Using Graph Databases In Journalism
William Lyon2.2K views
The RDF Report Card: Beyond the Triple Count by Leigh Dodds
The RDF Report Card: Beyond the Triple CountThe RDF Report Card: Beyond the Triple Count
The RDF Report Card: Beyond the Triple Count
Leigh Dodds6.9K views
Introduction to Linked Data 1/5 by Juan Sequeda
Introduction to Linked Data 1/5Introduction to Linked Data 1/5
Introduction to Linked Data 1/5
Juan Sequeda1.9K views
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica... by Michael Cummings
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
ELUNA2013:Providing Voyager catalog data in a custom, open source web applica...
Michael Cummings568 views
SF Python Meetup: TextRank in Python by Paco Nathan
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
Paco Nathan5.7K views
Scalable Web Data Management using RDF by Navid Sedighpour
Scalable Web Data Management using RDF  Scalable Web Data Management using RDF
Scalable Web Data Management using RDF
Navid Sedighpour345 views
ODI Summit 2016 - Linked Open Data at Springer Nature by Michele Pasin
ODI Summit 2016 - Linked Open Data at Springer NatureODI Summit 2016 - Linked Open Data at Springer Nature
ODI Summit 2016 - Linked Open Data at Springer Nature
Michele Pasin1K views

Similar to Let your data shine... with OpenRefine

Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA by
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
82 views43 slides
How to build and run a big data platform in the 21st century by
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st centuryAli Dasdan
868 views133 slides
H2O & Tensorflow - Fabrizio by
H2O & Tensorflow - Fabrizio H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio Sri Ambati
5.4K views39 slides
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a... by
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
289 views36 slides
DataHub by
DataHubDataHub
DataHubAditya Parameswaran
2K views28 slides
Google Dremel. Concept and Implementations. by
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
7.7K views19 slides

Similar to Let your data shine... with OpenRefine(20)

Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA by PRBETTER
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
PRBETTER82 views
How to build and run a big data platform in the 21st century by Ali Dasdan
How to build and run a big data platform in the 21st centuryHow to build and run a big data platform in the 21st century
How to build and run a big data platform in the 21st century
Ali Dasdan868 views
H2O & Tensorflow - Fabrizio by Sri Ambati
H2O & Tensorflow - Fabrizio H2O & Tensorflow - Fabrizio
H2O & Tensorflow - Fabrizio
Sri Ambati5.4K views
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a... by Rehgan Avon
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Rehgan Avon289 views
Google Dremel. Concept and Implementations. by Vicente Orjales
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
Vicente Orjales7.7K views
Open Data Portals: 9 Solutions and How they Compare by Safe Software
Open Data Portals: 9 Solutions and How they CompareOpen Data Portals: 9 Solutions and How they Compare
Open Data Portals: 9 Solutions and How they Compare
Safe Software22.2K views
Linked Open Data Principles, benefits of LOD for sustainable development by Martin Kaltenböck
Linked Open Data Principles, benefits of LOD for sustainable developmentLinked Open Data Principles, benefits of LOD for sustainable development
Linked Open Data Principles, benefits of LOD for sustainable development
Martin Kaltenböck2.6K views
Introduction to Open Data and Data Science by Suraj Kumar Jana
Introduction to Open Data and Data ScienceIntroduction to Open Data and Data Science
Introduction to Open Data and Data Science
Suraj Kumar Jana92 views
Presenting Your Digital Research by Shawn Day
Presenting Your Digital ResearchPresenting Your Digital Research
Presenting Your Digital Research
Shawn Day19.3K views
Drupal 8 preview_slideshow by Tee Malapela
Drupal 8 preview_slideshowDrupal 8 preview_slideshow
Drupal 8 preview_slideshow
Tee Malapela599 views
Building Data Products with Python (Georgetown) by Benjamin Bengfort
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort5.4K views
(PROJEKTURA) open data big data @tgg osijek by Ratko Mutavdzic
(PROJEKTURA) open data big data @tgg osijek(PROJEKTURA) open data big data @tgg osijek
(PROJEKTURA) open data big data @tgg osijek
Ratko Mutavdzic524 views
OpenRefine reconciliation services by Jen Hammock
OpenRefine reconciliation servicesOpenRefine reconciliation services
OpenRefine reconciliation services
Jen Hammock107 views
Searchlight + Horizon - Mitaka march 2016 by Travis Tripp
Searchlight  + Horizon - Mitaka march 2016Searchlight  + Horizon - Mitaka march 2016
Searchlight + Horizon - Mitaka march 2016
Travis Tripp329 views

More from Open Knowledge Belgium

Open Data Stories You haven't heard! by
Open Data Stories You haven't heard!Open Data Stories You haven't heard!
Open Data Stories You haven't heard!Open Knowledge Belgium
874 views21 slides
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT) by
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)Open Knowledge Belgium
870 views40 slides
Smarter by Open Data: Process and Practice in Flevoland (NL) by
Smarter by Open Data: Process and Practice in Flevoland (NL)Smarter by Open Data: Process and Practice in Flevoland (NL)
Smarter by Open Data: Process and Practice in Flevoland (NL)Open Knowledge Belgium
584 views26 slides
Open Knowledge for Social Innovation by
Open Knowledge for Social InnovationOpen Knowledge for Social Innovation
Open Knowledge for Social InnovationOpen Knowledge Belgium
313 views30 slides
Smart Flanders: Tackling urban challenges through Open Data by
Smart Flanders: Tackling urban challenges through Open DataSmart Flanders: Tackling urban challenges through Open Data
Smart Flanders: Tackling urban challenges through Open DataOpen Knowledge Belgium
518 views41 slides
EIF and NIFO connecting public administrations, businesses, and citizens by
EIF and NIFO connecting public administrations, businesses, and citizensEIF and NIFO connecting public administrations, businesses, and citizens
EIF and NIFO connecting public administrations, businesses, and citizensOpen Knowledge Belgium
6.2K views27 slides

More from Open Knowledge Belgium(20)

A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT) by Open Knowledge Belgium
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
A​ FUNUMENTARY:​ Take what you can, give nothing back...​ ​(NOT)
Smarter by Open Data: Process and Practice in Flevoland (NL) by Open Knowledge Belgium
Smarter by Open Data: Process and Practice in Flevoland (NL)Smarter by Open Data: Process and Practice in Flevoland (NL)
Smarter by Open Data: Process and Practice in Flevoland (NL)
EIF and NIFO connecting public administrations, businesses, and citizens by Open Knowledge Belgium
EIF and NIFO connecting public administrations, businesses, and citizensEIF and NIFO connecting public administrations, businesses, and citizens
EIF and NIFO connecting public administrations, businesses, and citizens
Connecting Open data for solving the fiscal transparency puzzle in the EU by Open Knowledge Belgium
Connecting Open data for solving the fiscal transparency puzzle in the EUConnecting Open data for solving the fiscal transparency puzzle in the EU
Connecting Open data for solving the fiscal transparency puzzle in the EU
Eliminating data roadbloacks to get by traffic roadblocks without pain by Open Knowledge Belgium
Eliminating data roadbloacks to get by traffic roadblocks without painEliminating data roadbloacks to get by traffic roadblocks without pain
Eliminating data roadbloacks to get by traffic roadblocks without pain
How we use the massive open lidar dataset for the benfit of our clients by Open Knowledge Belgium
How we use the massive open lidar dataset for the benfit of our clientsHow we use the massive open lidar dataset for the benfit of our clients
How we use the massive open lidar dataset for the benfit of our clients

Recently uploaded

Managing Github via Terrafom.pdf by
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdfmicharaeck
5 views47 slides
Helko van den Brom - VSL by
Helko van den Brom - VSLHelko van den Brom - VSL
Helko van den Brom - VSLDutch Power
120 views18 slides
corporate-presentation.pdf by
corporate-presentation.pdfcorporate-presentation.pdf
corporate-presentation.pdfShaun Heinrichs
77 views22 slides
Roozbeh Torkzadeh - TU Eindhoven by
Roozbeh Torkzadeh - TU EindhovenRoozbeh Torkzadeh - TU Eindhoven
Roozbeh Torkzadeh - TU EindhovenDutch Power
111 views14 slides
Gym Members Community.pptx by
Gym Members Community.pptxGym Members Community.pptx
Gym Members Community.pptxnasserbf1987
10 views5 slides
Competition and Professional Sports –MENDES – December 2023 OECD discussion by
Competition and Professional Sports –MENDES – December 2023 OECD discussionCompetition and Professional Sports –MENDES – December 2023 OECD discussion
Competition and Professional Sports –MENDES – December 2023 OECD discussionOECD Directorate for Financial and Enterprise Affairs
219 views4 slides

Recently uploaded(20)

Managing Github via Terrafom.pdf by micharaeck
Managing Github via Terrafom.pdfManaging Github via Terrafom.pdf
Managing Github via Terrafom.pdf
micharaeck5 views
Helko van den Brom - VSL by Dutch Power
Helko van den Brom - VSLHelko van den Brom - VSL
Helko van den Brom - VSL
Dutch Power120 views
Roozbeh Torkzadeh - TU Eindhoven by Dutch Power
Roozbeh Torkzadeh - TU EindhovenRoozbeh Torkzadeh - TU Eindhoven
Roozbeh Torkzadeh - TU Eindhoven
Dutch Power111 views
Gym Members Community.pptx by nasserbf1987
Gym Members Community.pptxGym Members Community.pptx
Gym Members Community.pptx
nasserbf198710 views
Christan van Dorst - Hyteps by Dutch Power
Christan van Dorst - HytepsChristan van Dorst - Hyteps
Christan van Dorst - Hyteps
Dutch Power119 views
I use my tools to help people by mywampa
I use my tools to help peopleI use my tools to help people
I use my tools to help people
mywampa9 views
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf by ChrisFerris
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdfChris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf
Chris Ferris Retrain Manitoba Presentation - CEA - June 2, 2023.pdf
ChrisFerris5 views

Let your data shine... with OpenRefine

  • 1. Let your data shine… with OpenRefine Open Belgium 2016 OpenRefine workshop Brosens - Desmet
  • 2. What people say: tweets @bartox: "Damn! Wish I had this 5 years ago! RT @swiertz nice tools ! Format & clean your data with Google Refine http: //goo.gl/UniR6 #cleanup #tools" view tweet @Musebrarian: "YIPEEEE! Google Refine works with OAI-PMH XML out of the box. This is going to make my life much easier." view tweet @kb: "It’s kind of ridiculous how exciting I find this: https://code.google.com/p/google-refine/" view tweet @litcritter: "I rarely feel the desire to kiss a corporation on the mouth, but Google Refine is making me come close http: //goo.gl/8pvKB #datageek" view tweet
  • 3. @LearonDalby: "I'm sold on #Google #Refine used it most of the day with "messy" data and managed to clean nearly all of it." view tweet @roolio: "Today google #refine saved my afternoon. Every #data #hacker should try it" view tweet @Salesient: "Google refine is awesome. Never before have I been home this early." view tweet @Mayin: "Not only will it clean your data, Google Refine will slice, dice and put bows on your hairdo!http://bit.ly/cPGn1E Rocks data exploration." view tweet @marklabedz: "Google Refine: Making interns unneccesary since 2010." view tweet @naterkane: "i'm completely in love with Google Refine. fo' reals." view tweet @LearonDalby: "Using #Google #Refine makes me happy. Even for the easy stuff." view tweet @loranstefani: "Google Refine: love at first click" view tweet @tracystan: "Google Refine is gonna change my life" view tweet What people say: tweets
  • 4. "Google Refine isn’t going to solve the problem of poor data availability, but for those who manage to gain access to existing records, it can be a powerful tool for transparency." Rebekah Heacock, co-director of the Technology for Transparency Network and a Project Coordinator at Harvard’s Berkman Center for Internet and Society - Sunlight Foundation, Tools for transparency: Google Refine. "Google Refine is an immensely powerful tool for dealing with "messy" data, and it sports a myriad of advanced features for massaging and analyzing complex data sets" Dmitri Popov (Linux Magazine) - Use Google Refine to Massage Your Data "For anyone who’s ever had to sort through messy data to try to turn up a meaningful treatment, and who hasn’t, this tool is a godsend." Michael Lines, SLAW - Google Refine 2.0 "Google Refine 2.0 will serve an excellent back-end for data visualization services. It has been well received by the Chicago Tribune and open-government data communities. Along with Google Squared, Refine 2.0 can create a powerful research tool." Chinmoy Kanjilal, Techie Buzz - Google Refine 2.0: Power Tools for Working With Data What people say: blogs
  • 5. ● Formerly known as Google Refine, now OpenRefine ● Site: http://openrefine.org ● Github: https://github.com/OpenRefine ● Used for ○ Data cleaning (detect and correct anomalies) ○ Transform data (change format, change datatype) ○ “Pimp” & “link” data (harvest & connect data from online databases) ● More powerful than a worksheet ● More visual than scripting A free, open source, powerful tool for working with messy data
  • 6. ● Supported by a large community (lots of tutorials and plugins) ● Works quite well up to 100.000 rows of data ● Supports several file formats ● The original file is unaffected ● OpenRefine runs in a modern browser, but does not require an internet connection (except when you connect to services) A free, open source, powerful tool for working with messy data
  • 7. Other tools OpenRefine Worksheet focus on cells focus on rows and columns focus on import data & calculations focus on exploring and transforming existing data Scripting data → script → output all steps are visualized focus on transformation of data Databases focus on queries looks like a worksheet you should know the data data is always visible, facets shows you choices OpenRefine vs other tools
  • 8. Distribution Description Authors LODRefine LODRefine is actually OpenRefine with integrated extensions that make transition from tabular data to Linked Data a bit easier. Integrated extensions are: RDF extension, DBpedia extension, Crowdsourcing extension, Stats extension Sparkica OpenDataRise Tool to cleanse and semantify datasets from CKAN repositories. Based on OpenRefine. Open Data in Trentino p3-batchrefine BatchRefine adds batch processing capabilities to OpenRefine and support multiple back end including spark SpazioDati SparkonRefine RefineOnSpark is a driver program to run OpenRefine jobs on the Spark cluster SpazioDati Reconciliation-and-Matching- Framework A framework to allow the matching of string entities using customised sets of transformations and matchers, plus a tool to produce the necessary configurations and another to expose them as OpenRefine reconciliation services. RBGKew Tools working with OpenRefine
  • 9. ● Download Google Refine on: http://openrefine.org/download.html ● Launch Google Refine ● Create a project ● Choose the file you want to clean (Example Dataset: Onderwijsaanbod in Vlaanderen (http://opendata.vlaanderen.be/dataset/onderwijsaanbod) Hands on: install OpenRefine
  • 10. ● Check the preview and define parsing ○ Set character encoding (UTF8) ○ Choose delimiter (/t ; , …) ○ Parse data as (csv) ○ Parse first line as column header, ignore first … line(s).... Hands on: importing data
  • 11. ● Accessing information organized according to a faceted classification system ○ Creating an overview of the data ○ Allows targeted editing of your data ○ Allows specific filtering ○ Facet choices as tab separated values (like pivot tables in Excel) Hands on: faceting
  • 12. ● Clustering allows to automatically group and edit different but similar values Hands on: clustering
  • 13. ● Common transforms: ○ to number ○ trim leading and trailing whitespace ○ to title case; to date; to number ● Split & Join multi valued cells Hands on: edit cells
  • 14. ● Split columns (by separator or field length) ● Add columns (by fetching urls or based on column) (use GREL) ● Move columns ● Remove columns ● Rename columns Hands on: edit columns
  • 15. ● GREL (google refine expression language) ○ add columns based on other column ■ basic string modification ■ find and replace ■ string parsing and splitting ■ calling web services ○ Result are always visible in the Preview Hands on: scripting using GREL
  • 16. ● Add columns by fetching url ■ find and replace ■ string parsing & splitting ■ add column based on column”straat” (value+”%20”+cells[‘huisnummer’].value) ■ Call google API (or openstreetmap or….) ("https://maps.googleapis. com/maps/api/geocode/json?address="+value+ cells["huisnummer"]. value&key=AIzaSyDY2Z6wehbIqIPrHIb9ljC62pwRqEHOous") ■ Parse JSON (value.parseJson()["results"][0]["geometry"]["location"]["lng"]) Hands on: georeferencing
  • 17. ● Grouping concepts with an external service, eg taxonomic reconciliation ○ Example from the natural environment (biodiversity data) ■ add a reconciliation service (reconcile, start reconciling) ■ Let’s use Encyclopedia of Life ■ Select Matches (Facet, Quick actions…) Hands on: reconciling
  • 18. ● Grouping concepts with an external service, eg taxonomic reconciliation ○ Example from the natural environment (biodiversity data) ■ add ID EOL ID column (GREL) cell.recon.match.id ■ create url based on EOL ID ■ http://eol.org/pages/3465521 Hands on: reconciling
  • 19. ● Merge data from the two projects by creating a new column from values from an existing column within one project that are used to index into a similar column in the other project ○ cell.cross("datasetname.csv","scientificName").cells["order"].value[0] Hands on: cross referencing
  • 20. ● Extract and save parts of your operation history as JSON that you can apply to this or other projects in the future. Hands on: Extract operation history
  • 21. ● https://github.com/OpenRefine/OpenRefine/wiki ● https://github.com/OpenRefine/OpenRefine/wiki/Recipes ● http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial ● ... Hands on: further reading