2. Influence Explorer
● connects the dots of political influence
● brings together datasets about:
○ lawmakers
○ lobbyists
○ corporations
● LOTS of huge datasets, from both
government and NGO's
○ campaign finance
○ lobbying
○ EPA
○ gov. grants and contracts, plus contract violations
○ federal regulations, and more...
4. Influence Explorer's Matching
Framework
● Uses Name Cleaver for standardization
● Hooks into Django for ORM magic
● Heuristic-based
● Results good, but process is messy
○ Merge matching result tables
○ Export to spreadsheet
○ Humans verify via marking columns, deleting rows
○ Data munged and reimported into DB
6. OpenRefine Features
● Cleaning: various built-in transforms, such
as whitespace trimming, case cleanup, etc.
● Faceting: groups records by a column value
and lets you see slices of your dataset
● Clustering: uses fuzzy algorithms (e.g.
Levenshtein distance) to find, group, and
clean up like records
● Reconciling: connects to an external API
endpoint to match records with ones in some
other dataset
7. How a Reconciliation Service Works
Refine sends the column value to the service,
and the service looks for potential matches and
and sends back ranked results.
Service can flag matches with very high
confidence. Back in Refine, the user can
choose to auto-match those.
Refine queries in batches of ten.
14. Woes
● Documentation
● Freebase-centric
● Service metadata is ill-defined and described
● Not RESTful
● Different formats to support for single and
multiple requests
● Very few (if any) extant examples!
● Bad error handling
○ Tip: after adding a non-functional dev RS API, delete
~/.local/share/google/refine to fully refresh
○ Watching Refine's log and having good logging in
your service are essential!
15. Back in Refine...
After reconciling:
● verification
● entity preview
○ Could be an additional Preview API, but we fudged
it.
● export
○ Can break out relevant values from reconciliation
results using Refine's JS-y language
■ cell.recon.match.name (arbitrary info from our
service!)
■ cell.recon.match.id
○ Don't use standard export, or you'll only get URLs in
the column
17. Future
● Adding support for API keys
● Opening up to public (??)
● Extraction reusable components
○ Query parsing
○ Match/results formatting
● Establish conventions
○ schema for contextual data (e.g. party, district, state
for politicians), for more flexible and better matching