Open refine reconciliation service api (dc python 2013_03_05)

OpenRefine
& Influence Explorer
Building a ReconciliationServiceAPI

Alison Rowland, Project Lead, Influence Explorer
Sunlight Foundation
DCPython 2013/03/05

Influence Explorer
● connects the dots of political influence
● brings together datasets about:
○ lawmakers
○ lobbyists
○ corporations
● LOTS of huge datasets, from both
government and NGO's
○ campaign finance
○ lobbying
○ EPA
○ gov. grants and contracts, plus contract violations
○ federal regulations, and more...

Entity Resolution
● datasets get matched to our entity universe
● rarely have unique or common IDs for
entities

Influence Explorer's Matching
Framework
● Uses Name Cleaver for standardization
● Hooks into Django for ORM magic
● Heuristic-based
● Results good, but process is messy
○ Merge matching result tables
○ Export to spreadsheet
○ Humans verify via marking columns, deleting rows
○ Data munged and reimported into DB

OpenRefine
Manages human process of verification

OpenRefine Features
● Cleaning: various built-in transforms, such
as whitespace trimming, case cleanup, etc.
● Faceting: groups records by a column value
and lets you see slices of your dataset
● Clustering: uses fuzzy algorithms (e.g.
Levenshtein distance) to find, group, and
clean up like records
● Reconciling: connects to an external API
endpoint to match records with ones in some
other dataset

How a Reconciliation Service Works
Refine sends the column value to the service,
and the service looks for potential matches and
and sends back ranked results.

Service can flag matches with very high
confidence. Back in Refine, the user can
choose to auto-match those.

Refine queries in batches of ten.

ReconciliationServiceAPI
One endpoint
● GET returns service metadata
● POST asks service for a match or matches

http://transparencydata.com/api/1.0/refine/reconcile

GET http://transparencydata.com/api/1.0/refine/reconcile
service_metadata = {
"name": "Influence Explorer Reconciliation",
"identifierSpace": "http://influenceexplorer.com/ns/entities",
"schemaspace": http://influenceexplorer.com/ns/entity.object.id",
"view": {
"url": "http://influenceexplorer.com/entity/{{id}}"
},
"preview": {
"url": "http://influenceexplorer.com/entity/{{id}}",
"width": 430,
"height": 300
},
"defaultTypes": [ ]
}

POST Ex. 1
queries=
{
"q0": {"query":"GELMAN, MATTHEW",
"type":"individual" , "type_strict": "should"},
"q1":{"query":"VAN DONGEN, DIRK W. MR.",
"type":"individual" ,"type_strict":"should"},
"q2":{"query":"PAXON, L WILLIAM",
"type":"individual","type_strict":"should"}
}

POST Ex. 1 Result
{
"q1": { "result": [{"score": 2, "type": ["individual"],
"id": "6b2dc2da3e144aab802e5ea28a9b4330",
"match": false, "name": "Dirk Van Dongen"}]
},
"q0": { "result": [{"score": 1.7, "type": ["individual"],
"id": "40a776e9833e47c9830490b8be21d7d3",
"match": false, "name": "Matt Gelman"}]},
"q2": {"result": []}
}

POST Ex. 2: types from dataset col.
queries={
"q0":{"query":"Coca-Cola Enterprises",
"properties":[{"pid":"contributionType","v":"
Corporation"}]},
"q1":{"query":"Coca-Cola Enterprises Inc", "properties":
[{"pid":"contributionType","v":"Corporation"}]},
"q2":{"query":"Coca-Coca Enterprises,Inc." ,"properties":
[{"pid":"contributionType","v":"Corporation"}]},
"q3":{"query":"Coca-Cola Company", "properties":
[{"pid":"contributionType","v":"Corporation"}]}
}

POST Ex. 2 Results
{"q1": {"result": [
{"score": 4, "type": ["organization"], "id":
"be61489cc7524b80b7672c9db1eb1aad", "match": true, "name": "Coca-Cola
Co"}, ...
"809977921c834a93a2a5ff27364f614f", "match": true, "name": "Coca-Cola
Bottlers Assn"},
"ec4fa3ee098b4a64ae5da8d61f2034c9", "match": false, "name": "Philadelphia
Coca-Cola Bottling"},
"ef9539d369994c15a9653ec218c29d17", "match": false, "name": "Florida
Coca-Cola Bottling Co"}
]},
"q0": {"result": [
{"score": 4, "type": ["organization"], "id": ... }]} }

Woes
● Documentation
● Freebase-centric
● Service metadata is ill-defined and described
● Not RESTful
● Different formats to support for single and
multiple requests
● Very few (if any) extant examples!
● Bad error handling
○ Tip: after adding a non-functional dev RS API, delete
~/.local/share/google/refine to fully refresh
○ Watching Refine's log and having good logging in
your service are essential!

Back in Refine...
After reconciling:
● verification
● entity preview
○ Could be an additional Preview API, but we fudged
it.
● export
○ Can break out relevant values from reconciliation
results using Refine's JS-y language
■ cell.recon.match.name (arbitrary info from our
service!)
■ cell.recon.match.id
○ Don't use standard export, or you'll only get URLs in
the column

Future
● Adding support for API keys
● Opening up to public (??)
● Extraction reusable components
○ Query parsing
○ Match/results formatting
● Establish conventions
○ schema for contextual data (e.g. party, district, state
for politicians), for more flexible and better matching

Contact & Code
arowland@sunlightfoundation.com
@arowla

http://www.sunlightfoundation.com
http://www.influenceexplorer.com

http://www.github.com/sunlightlabs/datacommons
http://www.github.com/sunlightlabs/name-cleaver

Open refine reconciliation service api (dc python 2013_03_05)

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Open refine reconciliation service api (dc python 2013_03_05)

Similar to Open refine reconciliation service api (dc python 2013_03_05) (20)

Open refine reconciliation service api (dc python 2013_03_05)