Using entity extraction extension with OpenRefine and Dandelion API

4,886 views

Published on

Food for thoughts to understand why you need entity extraction capabilities inside OpenRefine. Some examples and scenarios.

Published in: Technology
1 Comment
5 Likes
Statistics
Notes
  • For links and other sources, this is the blog post:
    http://blog.spaziodati.eu/en/2014/07/24/using-openrefine-to-perform-text-mining-on-your-data-food-for-thoughts/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,886
On SlideShare
0
From Embeds
0
Number of Embeds
2,186
Actions
Shares
0
Downloads
18
Comments
1
Likes
5
Embeds 0
No embeds

No notes for slide

Using entity extraction extension with OpenRefine and Dandelion API

  1. 1. Using entity extraction extension with OpenRefine and Dandelion API ! food for thoughts
  2. 2. What we are talking about OpenRefine www.openrefine.org NER extension integrated with Dandelion API http://freeyourmetadata.org/named-entity-extraction/ (dandelion.eu)
  3. 3. What industries are using OpenRefine? https://groups.google.com/d/msg/openrefine/vA75Ac_XODo/AfG8IRlEfSAJ
  4. 4. data journalists metadata curators museums libraries research labs SEO folks data scientists enterprises universities patent attorneys Open Data hackers Social Media specialists civil servants
  5. 5. What does OpenRefine offer that other data-parsing tools don't? http://opendata.stackexchange.com/questions/515/what-does-openrefine-offer-that-other-data-parsing-tools-dont
  6. 6. reconciliation of text data against reference data services containing strong identifiers (Freebase, OpenCorporates, any SPARQL or RDF, etc) ! simple linking of reconciled entities to other info sources like Wikipedia, MusicBrainz, IMDB, etc […] […]
  7. 7. How we are using it, at SpazioDati?
  8. 8. OpenRefine is inside our data curation controller
  9. 9. normalize, clean and extract data from different sources reconcile against internal reconciliation services ( administrative regions, names and telephone numbers… ) apply rules and transformations to data, aligned it with our internal ontologies
  10. 10. A look at OpenRefine & reconciliation
  11. 11. Why it’s useful reconciliation? Instruments bla bla bla bla bla bla bla … what kind of instruments?
  12. 12. reconciliation identifies keywords in flowing text and gives them a URL from strings to things
  13. 13. instruments data column musical instruments measuring instruments aeronautical instruments URL URL URL Instruments bla bla bla
  14. 14. reconciliation works great for those fields in your dataset that contain single terms names of people countries, works of art […]
  15. 15. and what if we have a column with unstructured texts, like this one?
  16. 16. we need a new step in the data curation workflow… a new column data, labelled “dataTXT” extract named entities using NER extension + Dandelion API data column with some texts
  17. 17. in this column, there are named concepts, linked to Wikipedia label + URI “Collective action” + http://en.wikipedia.org/wiki/Collective_action
  18. 18. make a text filter looking for a concept classify and categorize the content … things, not strings
  19. 19. some scenarios
  20. 20. Open Data community real issues Using OpenRefine + NER extension with Dandelion API extract meaninful informations from some CVs, like names, organizations, skills, … http://opendata.stackexchange.com/search?page=3&tab=relevance&q=extraction normalize organizations names cited in some texts
  21. 21. Data journalists Using OpenRefine + NER extension with Dandelion API extract relevant news to a precise topic ( a person, a brand or a company ) write a summary from a politician speech, starting from the main concepts extracted from the text mine specific informations in judicial decisions (judge's name, court, area of law and neutral citation number
  22. 22. Using OpenRefine + NER extension with Dandelion API Text mining on tweets: extract brands, places and concepts easily from a twitter flow related to an event Text mining on website content: extract concepts and places easily from a webpage, to improve website SEO ranking Social media specialists
  23. 23. Using OpenRefine + NER extension with Dandelion API Understand your own bank account statements: extract useful informations, like brands and places, to categorize and classify your own expenses “Quantify self” movement Analytics on Personal Data
  24. 24. @dandelionapi #refine #ner you know other use cases? tell us on Twitter! @spaziodatidandelion.eu

×