Open refine reconciliation service api (dc python 2013_03_05)
OpenRefine& Influence ExplorerBuilding a ReconciliationServiceAPI Alison Rowland, Project Lead, Influence Explorer Sunlight Foundation DCPython 2013/03/05
Influence Explorer● connects the dots of political influence● brings together datasets about: ○ lawmakers ○ lobbyists ○ corporations● LOTS of huge datasets, from both government and NGOs ○ campaign finance ○ lobbying ○ EPA ○ gov. grants and contracts, plus contract violations ○ federal regulations, and more...
Entity Resolution● datasets get matched to our entity universe● rarely have unique or common IDs for entities
Influence Explorers MatchingFramework● Uses Name Cleaver for standardization● Hooks into Django for ORM magic● Heuristic-based● Results good, but process is messy ○ Merge matching result tables ○ Export to spreadsheet ○ Humans verify via marking columns, deleting rows ○ Data munged and reimported into DB
OpenRefineManages human process of verification
OpenRefine Features● Cleaning: various built-in transforms, such as whitespace trimming, case cleanup, etc.● Faceting: groups records by a column value and lets you see slices of your dataset● Clustering: uses fuzzy algorithms (e.g. Levenshtein distance) to find, group, and clean up like records● Reconciling: connects to an external API endpoint to match records with ones in some other dataset
How a Reconciliation Service WorksRefine sends the column value to the service,and the service looks for potential matches andand sends back ranked results.Service can flag matches with very highconfidence. Back in Refine, the user canchoose to auto-match those.Refine queries in batches of ten.
ReconciliationServiceAPIOne endpoint● GET returns service metadata● POST asks service for a match or matcheshttp://transparencydata.com/api/1.0/refine/reconcile
Woes● Documentation● Freebase-centric● Service metadata is ill-defined and described● Not RESTful● Different formats to support for single and multiple requests● Very few (if any) extant examples!● Bad error handling ○ Tip: after adding a non-functional dev RS API, delete ~/.local/share/google/refine to fully refresh ○ Watching Refines log and having good logging in your service are essential!
Back in Refine...After reconciling:● verification● entity preview ○ Could be an additional Preview API, but we fudged it.● export ○ Can break out relevant values from reconciliation results using Refines JS-y language ■ cell.recon.match.name (arbitrary info from our service!) ■ cell.recon.match.id ○ Dont use standard export, or youll only get URLs in the column
Future● Adding support for API keys● Opening up to public (??)● Extraction reusable components ○ Query parsing ○ Match/results formatting● Establish conventions ○ schema for contextual data (e.g. party, district, state for politicians), for more flexible and better matching