0
Using entity extraction extension with 	

OpenRefine and Dandelion API	

!
food for thoughts
What we are talking about
OpenRefine www.openrefine.org
NER extension integrated with
Dandelion API
http://freeyourmetadata....
What industries are using OpenRefine?
https://groups.google.com/d/msg/openrefine/vA75Ac_XODo/AfG8IRlEfSAJ
data journalists
metadata curators
museums
libraries
research labs
SEO folks
data scientists
enterprises
universities
pate...
What does OpenRefine offer that other 	

data-parsing tools don't?
http://opendata.stackexchange.com/questions/515/what-doe...
reconciliation of text data against reference data 	

services containing strong identifiers (Freebase,
OpenCorporates, any...
How we are using it, at SpazioDati?
OpenRefine is inside 
our data curation controller
normalize, clean and extract data from different 	

sources	

reconcile against internal reconciliation services 	

( admi...
A look at OpenRefine &	

reconciliation
Why it’s useful reconciliation?
Instruments
bla bla bla
bla bla bla bla
…
what kind of 	

instruments?
reconciliation identifies 	

keywords in flowing text and gives them a URL
from strings to things
instruments	

data column
musical instruments
measuring instruments
aeronautical instruments
URL
URL
URL
Instruments
bla b...
reconciliation works great for those fields 	

in your dataset that contain single terms
names of people	

countries, 	

wo...
and what if we have a column with	

unstructured texts, like this one?
we need a new step in the data curation workflow…
a new column data,	

labelled “dataTXT”
extract named 	

entities using	
...
in this column, there are named concepts, 	

linked to Wikipedia
label + URI
“Collective action” + http://en.wikipedia.org...
make a text filter
looking for a concept
classify and categorize 	

the content
…
things, not strings
some scenarios
Open Data community real issues
Using OpenRefine + NER extension with 	

Dandelion API
extract meaninful informations from ...
Data journalists
Using OpenRefine + NER extension with 	

Dandelion API
extract relevant news to a precise topic	

( a pers...
Using OpenRefine + NER extension with 	

Dandelion API
Text mining on tweets: extract brands,	

places and concepts easily ...
Using OpenRefine + NER extension with 	

Dandelion API
Understand your own bank account statements: 	

extract useful infor...
@dandelionapi	

#refine	

#ner
you know other use cases?	

tell us on Twitter!
@spaziodatidandelion.eu
Upcoming SlideShare
Loading in...5
×

Using entity extraction extension with OpenRefine and Dandelion API

2,844

Published on

Food for thoughts to understand why you need entity extraction capabilities inside OpenRefine. Some examples and scenarios.

Published in: Technology
1 Comment
4 Likes
Statistics
Notes
  • For links and other sources, this is the blog post:
    http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
2,844
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
16
Comments
1
Likes
4
Embeds 0
No embeds

No notes for slide

Transcript of "Using entity extraction extension with OpenRefine and Dandelion API"

  1. 1. Using entity extraction extension with OpenRefine and Dandelion API ! food for thoughts
  2. 2. What we are talking about OpenRefine www.openrefine.org NER extension integrated with Dandelion API http://freeyourmetadata.org/named-entity-extraction/ (dandelion.eu)
  3. 3. What industries are using OpenRefine? https://groups.google.com/d/msg/openrefine/vA75Ac_XODo/AfG8IRlEfSAJ
  4. 4. data journalists metadata curators museums libraries research labs SEO folks data scientists enterprises universities patent attorneys Open Data hackers Social Media specialists civil servants
  5. 5. What does OpenRefine offer that other data-parsing tools don't? http://opendata.stackexchange.com/questions/515/what-does-openrefine-offer-that-other-data-parsing-tools-dont
  6. 6. reconciliation of text data against reference data services containing strong identifiers (Freebase, OpenCorporates, any SPARQL or RDF, etc) ! simple linking of reconciled entities to other info sources like Wikipedia, MusicBrainz, IMDB, etc […] […]
  7. 7. How we are using it, at SpazioDati?
  8. 8. OpenRefine is inside our data curation controller
  9. 9. normalize, clean and extract data from different sources reconcile against internal reconciliation services ( administrative regions, names and telephone numbers… ) apply rules and transformations to data, aligned it with our internal ontologies
  10. 10. A look at OpenRefine & reconciliation
  11. 11. Why it’s useful reconciliation? Instruments bla bla bla bla bla bla bla … what kind of instruments?
  12. 12. reconciliation identifies keywords in flowing text and gives them a URL from strings to things
  13. 13. instruments data column musical instruments measuring instruments aeronautical instruments URL URL URL Instruments bla bla bla
  14. 14. reconciliation works great for those fields in your dataset that contain single terms names of people countries, works of art […]
  15. 15. and what if we have a column with unstructured texts, like this one?
  16. 16. we need a new step in the data curation workflow… a new column data, labelled “dataTXT” extract named entities using NER extension + Dandelion API data column with some texts
  17. 17. in this column, there are named concepts, linked to Wikipedia label + URI “Collective action” + http://en.wikipedia.org/wiki/Collective_action
  18. 18. make a text filter looking for a concept classify and categorize the content … things, not strings
  19. 19. some scenarios
  20. 20. Open Data community real issues Using OpenRefine + NER extension with Dandelion API extract meaninful informations from some CVs, like names, organizations, skills, … http://opendata.stackexchange.com/search?page=3&tab=relevance&q=extraction normalize organizations names cited in some texts
  21. 21. Data journalists Using OpenRefine + NER extension with Dandelion API extract relevant news to a precise topic ( a person, a brand or a company ) write a summary from a politician speech, starting from the main concepts extracted from the text mine specific informations in judicial decisions (judge's name, court, area of law and neutral citation number
  22. 22. Using OpenRefine + NER extension with Dandelion API Text mining on tweets: extract brands, places and concepts easily from a twitter flow related to an event Text mining on website content: extract concepts and places easily from a webpage, to improve website SEO ranking Social media specialists
  23. 23. Using OpenRefine + NER extension with Dandelion API Understand your own bank account statements: extract useful informations, like brands and places, to categorize and classify your own expenses “Quantify self” movement Analytics on Personal Data
  24. 24. @dandelionapi #refine #ner you know other use cases? tell us on Twitter! @spaziodatidandelion.eu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×