Open refine reconciliation service api (dc python 2013_03_05)


Published on

Building a ReconciliationServiceAPI for OpenRefine to match names against.

1 Comment
  • Sounds like most of the things on your 'Woes' slide are documentation related. Perhaps you could apply the lessons you learned to help improved the documentation in the wiki
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Open refine reconciliation service api (dc python 2013_03_05)

  1. 1. OpenRefine& Influence ExplorerBuilding a ReconciliationServiceAPI Alison Rowland, Project Lead, Influence Explorer Sunlight Foundation DCPython 2013/03/05
  2. 2. Influence Explorer● connects the dots of political influence● brings together datasets about: ○ lawmakers ○ lobbyists ○ corporations● LOTS of huge datasets, from both government and NGOs ○ campaign finance ○ lobbying ○ EPA ○ gov. grants and contracts, plus contract violations ○ federal regulations, and more...
  3. 3. Entity Resolution● datasets get matched to our entity universe● rarely have unique or common IDs for entities
  4. 4. Influence Explorers MatchingFramework● Uses Name Cleaver for standardization● Hooks into Django for ORM magic● Heuristic-based● Results good, but process is messy ○ Merge matching result tables ○ Export to spreadsheet ○ Humans verify via marking columns, deleting rows ○ Data munged and reimported into DB
  5. 5. OpenRefineManages human process of verification
  6. 6. OpenRefine Features● Cleaning: various built-in transforms, such as whitespace trimming, case cleanup, etc.● Faceting: groups records by a column value and lets you see slices of your dataset● Clustering: uses fuzzy algorithms (e.g. Levenshtein distance) to find, group, and clean up like records● Reconciling: connects to an external API endpoint to match records with ones in some other dataset
  7. 7. How a Reconciliation Service WorksRefine sends the column value to the service,and the service looks for potential matches andand sends back ranked results.Service can flag matches with very highconfidence. Back in Refine, the user canchoose to auto-match those.Refine queries in batches of ten.
  8. 8. ReconciliationServiceAPIOne endpoint● GET returns service metadata● POST asks service for a match or matches
  9. 9. GET service_metadata = { "name": "Influence Explorer Reconciliation", "identifierSpace": "", "schemaspace":", "view": { "url": "{{id}}" }, "preview": { "url": "{{id}}", "width": 430, "height": 300 }, "defaultTypes": [ ] }
  10. 10. POST Ex. 1queries= { "q0": {"query":"GELMAN, MATTHEW", "type":"individual" , "type_strict": "should"}, "q1":{"query":"VAN DONGEN, DIRK W. MR.", "type":"individual" ,"type_strict":"should"}, "q2":{"query":"PAXON, L WILLIAM", "type":"individual","type_strict":"should"} }
  11. 11. POST Ex. 1 Result{ "q1": { "result": [{"score": 2, "type": ["individual"], "id": "6b2dc2da3e144aab802e5ea28a9b4330", "match": false, "name": "Dirk Van Dongen"}] }, "q0": { "result": [{"score": 1.7, "type": ["individual"], "id": "40a776e9833e47c9830490b8be21d7d3", "match": false, "name": "Matt Gelman"}]}, "q2": {"result": []}}
  12. 12. POST Ex. 2: types from dataset col.queries={ "q0":{"query":"Coca-Cola Enterprises", "properties":[{"pid":"contributionType","v":" Corporation"}]}, "q1":{"query":"Coca-Cola Enterprises Inc", "properties": [{"pid":"contributionType","v":"Corporation"}]}, "q2":{"query":"Coca-Coca Enterprises,Inc." ,"properties": [{"pid":"contributionType","v":"Corporation"}]}, "q3":{"query":"Coca-Cola Company", "properties": [{"pid":"contributionType","v":"Corporation"}]}}
  13. 13. POST Ex. 2 Results{"q1": {"result": [ {"score": 4, "type": ["organization"], "id":"be61489cc7524b80b7672c9db1eb1aad", "match": true, "name": "Coca-ColaCo"}, ... {"score": 4, "type": ["organization"], "id":"809977921c834a93a2a5ff27364f614f", "match": true, "name": "Coca-ColaBottlers Assn"}, {"score": 2, "type": ["organization"], "id":"ec4fa3ee098b4a64ae5da8d61f2034c9", "match": false, "name": "PhiladelphiaCoca-Cola Bottling"}, {"score": 2, "type": ["organization"], "id":"ef9539d369994c15a9653ec218c29d17", "match": false, "name": "FloridaCoca-Cola Bottling Co"}]},"q0": {"result": [ {"score": 4, "type": ["organization"], "id": ... }]} }
  14. 14. Woes● Documentation● Freebase-centric● Service metadata is ill-defined and described● Not RESTful● Different formats to support for single and multiple requests● Very few (if any) extant examples!● Bad error handling ○ Tip: after adding a non-functional dev RS API, delete ~/.local/share/google/refine to fully refresh ○ Watching Refines log and having good logging in your service are essential!
  15. 15. Back in Refine...After reconciling:● verification● entity preview ○ Could be an additional Preview API, but we fudged it.● export ○ Can break out relevant values from reconciliation results using Refines JS-y language ■ (arbitrary info from our service!) ■ ○ Dont use standard export, or youll only get URLs in the column
  16. 16. Demo
  17. 17. Future● Adding support for API keys● Opening up to public (??)● Extraction reusable components ○ Query parsing ○ Match/results formatting● Establish conventions ○ schema for contextual data (e.g. party, district, state for politicians), for more flexible and better matching
  18. 18. Questions?
  19. 19. Contact &