SlideShare a Scribd company logo
1 of 19
Download to read offline
OpenRefine
& Influence Explorer
Building a ReconciliationServiceAPI



   Alison Rowland, Project Lead, Influence Explorer
                Sunlight Foundation
               DCPython 2013/03/05
Influence Explorer
● connects the dots of political influence
● brings together datasets about:
   ○ lawmakers
   ○ lobbyists
   ○ corporations
● LOTS of huge datasets, from both
  government and NGO's
   ○   campaign finance
   ○   lobbying
   ○   EPA
   ○   gov. grants and contracts, plus contract violations
   ○   federal regulations, and more...
Entity Resolution
● datasets get matched to our entity universe
● rarely have unique or common IDs for
  entities
Influence Explorer's Matching
Framework
●   Uses Name Cleaver for standardization
●   Hooks into Django for ORM magic
●   Heuristic-based
●   Results good, but process is messy
    ○   Merge matching result tables
    ○   Export to spreadsheet
    ○   Humans verify via marking columns, deleting rows
    ○   Data munged and reimported into DB
OpenRefine
Manages human process of verification
OpenRefine Features
● Cleaning: various built-in transforms, such
  as whitespace trimming, case cleanup, etc.
● Faceting: groups records by a column value
  and lets you see slices of your dataset
● Clustering: uses fuzzy algorithms (e.g.
  Levenshtein distance) to find, group, and
  clean up like records
● Reconciling: connects to an external API
  endpoint to match records with ones in some
  other dataset
How a Reconciliation Service Works
Refine sends the column value to the service,
and the service looks for potential matches and
and sends back ranked results.

Service can flag matches with very high
confidence. Back in Refine, the user can
choose to auto-match those.

Refine queries in batches of ten.
ReconciliationServiceAPI
One endpoint
● GET returns service metadata
● POST asks service for a match or matches

http://transparencydata.com/api/1.0/refine/reconcile
GET http://transparencydata.com/api/1.0/refine/reconcile
 service_metadata = {
             "name": "Influence Explorer Reconciliation",
             "identifierSpace": "http://influenceexplorer.com/ns/entities",
             "schemaspace": http://influenceexplorer.com/ns/entity.object.id",
             "view": {
                 "url": "http://influenceexplorer.com/entity/{{id}}"
              },
             "preview": {
                 "url": "http://influenceexplorer.com/entity/{{id}}",
                 "width": 430,
                 "height": 300
              },
             "defaultTypes": [ ]
 }
POST Ex. 1
queries=
   {
      "q0": {"query":"GELMAN, MATTHEW",
         "type":"individual" , "type_strict": "should"},
      "q1":{"query":"VAN DONGEN, DIRK W. MR.",
         "type":"individual" ,"type_strict":"should"},
      "q2":{"query":"PAXON, L WILLIAM",
         "type":"individual","type_strict":"should"}
   }
POST Ex. 1 Result
{
    "q1": { "result": [{"score": 2, "type": ["individual"],
            "id": "6b2dc2da3e144aab802e5ea28a9b4330",
            "match": false, "name": "Dirk Van Dongen"}]
    },
    "q0": { "result": [{"score": 1.7, "type": ["individual"],
       "id": "40a776e9833e47c9830490b8be21d7d3",
       "match": false, "name": "Matt Gelman"}]},
    "q2": {"result": []}
}
POST Ex. 2: types from dataset col.
queries={
   "q0":{"query":"Coca-Cola Enterprises",
   "properties":[{"pid":"contributionType","v":"
   Corporation"}]},
   "q1":{"query":"Coca-Cola Enterprises Inc", "properties":
   [{"pid":"contributionType","v":"Corporation"}]},
   "q2":{"query":"Coca-Coca Enterprises,Inc." ,"properties":
   [{"pid":"contributionType","v":"Corporation"}]},
   "q3":{"query":"Coca-Cola Company", "properties":
   [{"pid":"contributionType","v":"Corporation"}]}
}
POST Ex. 2 Results
{"q1": {"result": [
     {"score": 4, "type": ["organization"], "id":
"be61489cc7524b80b7672c9db1eb1aad", "match": true, "name": "Coca-Cola
Co"}, ...
     {"score": 4, "type": ["organization"], "id":
"809977921c834a93a2a5ff27364f614f", "match": true, "name": "Coca-Cola
Bottlers Assn"},
     {"score": 2, "type": ["organization"], "id":
"ec4fa3ee098b4a64ae5da8d61f2034c9", "match": false, "name": "Philadelphia
Coca-Cola Bottling"},
     {"score": 2, "type": ["organization"], "id":
"ef9539d369994c15a9653ec218c29d17", "match": false, "name": "Florida
Coca-Cola Bottling Co"}
]},
"q0": {"result": [
     {"score": 4, "type": ["organization"], "id": ... }]} }
Woes
● Documentation
● Freebase-centric
● Service metadata is ill-defined and described
● Not RESTful
● Different formats to support for single and
  multiple requests
● Very few (if any) extant examples!
● Bad error handling
    ○ Tip: after adding a non-functional dev RS API, delete
      ~/.local/share/google/refine to fully refresh
    ○ Watching Refine's log and having good logging in
      your service are essential!
Back in Refine...
After reconciling:
● verification
● entity preview
   ○ Could be an additional Preview API, but we fudged
     it.
● export
   ○ Can break out relevant values from reconciliation
     results using Refine's JS-y language
      ■ cell.recon.match.name (arbitrary info from our
        service!)
      ■ cell.recon.match.id
   ○ Don't use standard export, or you'll only get URLs in
     the column
Demo
Future
● Adding support for API keys
● Opening up to public (??)
● Extraction reusable components
  ○ Query parsing
  ○ Match/results formatting
● Establish conventions
  ○ schema for contextual data (e.g. party, district, state
    for politicians), for more flexible and better matching
Questions?
Contact & Code
arowland@sunlightfoundation.com
@arowla

http://www.sunlightfoundation.com
http://www.influenceexplorer.com

http://www.github.com/sunlightlabs/datacommons
http://www.github.com/sunlightlabs/name-cleaver

More Related Content

What's hot

The Future is Federated
The Future is FederatedThe Future is Federated
The Future is FederatedRuben Verborgh
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?Ruben Verborgh
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availabilityRuben Verborgh
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseNaveen Kumar
 
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebMartin Klein
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic WebTwo Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic WebDanny Ayers
 
Authentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLAuthentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLNikolas Burk
 
The Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureThe Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureNikolas Burk
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataDiego Valerio Camarda
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APISpazioDati
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panicoDiego Valerio Camarda
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web PagesMichael Nelson
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Lucidworks
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersSpazioDati
 

What's hot (19)

The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
On the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly WebOn the Persistence of Persistent Identifiers of the Scholarly Web
On the Persistence of Persistent Identifiers of the Scholarly Web
 
Web data from R
Web data from RWeb data from R
Web data from R
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic WebTwo Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
Two Webs! : combining the best of Web 1.0, Web 2.0 and the Semantic Web
 
Pdfsamplefile
PdfsamplefilePdfsamplefile
Pdfsamplefile
 
Authentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQLAuthentication, Authorization & Error Handling with GraphQL
Authentication, Authorization & Error Handling with GraphQL
 
The Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend ArchitectureThe Serverless GraphQL Backend Architecture
The Serverless GraphQL Backend Architecture
 
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)dataSUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
SUMMER SCHOOL LEX 2014 - RDF + SPARQL querying the web of (lex)data
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Web 3 0
Web 3 0Web 3 0
Web 3 0
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
 

Similar to Open refine reconciliation service api (dc python 2013_03_05)

GraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedGraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedMarcinStachniuk
 
User Profiles: I Didn't Know I Could Do That? (Demo Slides)
User Profiles:  I Didn't Know I Could Do That?  (Demo Slides)User Profiles:  I Didn't Know I Could Do That?  (Demo Slides)
User Profiles: I Didn't Know I Could Do That? (Demo Slides)Stacy Deere
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solrTrey Grainger
 
Building a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrBuilding a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrlucenerevolution
 
Creating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APICreating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APIKirsten Hunter
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search resultsJettro Coenradie
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Natasha Wilson
 
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...Abhimanyu Lad
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsMike Broberg
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampAlexei Gorobets
 
Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling databaseMUG Perú
 
Hypermedia APIs and HATEOAS
Hypermedia APIs and HATEOASHypermedia APIs and HATEOAS
Hypermedia APIs and HATEOASVladimir Tsukur
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System ModernisationMongoDB
 

Similar to Open refine reconciliation service api (dc python 2013_03_05) (20)

GraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learnedGraphQL - when REST API is not enough - lessons learned
GraphQL - when REST API is not enough - lessons learned
 
User Profiles: I Didn't Know I Could Do That? (Demo Slides)
User Profiles:  I Didn't Know I Could Do That?  (Demo Slides)User Profiles:  I Didn't Know I Could Do That?  (Demo Slides)
User Profiles: I Didn't Know I Could Do That? (Demo Slides)
 
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
MongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and ImplicationsMongoDB Schema Design: Practical Applications and Implications
MongoDB Schema Design: Practical Applications and Implications
 
Building a real time big data analytics platform with solr
Building a real time big data analytics platform with solrBuilding a real time big data analytics platform with solr
Building a real time big data analytics platform with solr
 
Building a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solrBuilding a real time, big data analytics platform with solr
Building a real time, big data analytics platform with solr
 
Creating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn APICreating Professional Applications with the LinkedIn API
Creating Professional Applications with the LinkedIn API
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Learning to rank search results
Learning to rank search resultsLearning to rank search results
Learning to rank search results
 
Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop Online | MongoDB Atlas on GCP Workshop
Online | MongoDB Atlas on GCP Workshop
 
Capcon 2010
Capcon 2010Capcon 2010
Capcon 2010
 
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
Fast, Lenient, and Accurate – Building Personalized Instant Search Experience...
 
SQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 QuestionsSQL to NoSQL: Top 6 Questions
SQL to NoSQL: Top 6 Questions
 
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Real-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @MoldcampReal-time search in Drupal with Elasticsearch @Moldcamp
Real-time search in Drupal with Elasticsearch @Moldcamp
 
Gab document db scaling database
Gab   document db scaling databaseGab   document db scaling database
Gab document db scaling database
 
Hypermedia APIs and HATEOAS
Hypermedia APIs and HATEOASHypermedia APIs and HATEOAS
Hypermedia APIs and HATEOAS
 
IOOF IT System Modernisation
IOOF IT System ModernisationIOOF IT System Modernisation
IOOF IT System Modernisation
 
Life on Clouds: a forensics overview
Life on Clouds: a forensics overviewLife on Clouds: a forensics overview
Life on Clouds: a forensics overview
 

Open refine reconciliation service api (dc python 2013_03_05)

  • 1. OpenRefine & Influence Explorer Building a ReconciliationServiceAPI Alison Rowland, Project Lead, Influence Explorer Sunlight Foundation DCPython 2013/03/05
  • 2. Influence Explorer ● connects the dots of political influence ● brings together datasets about: ○ lawmakers ○ lobbyists ○ corporations ● LOTS of huge datasets, from both government and NGO's ○ campaign finance ○ lobbying ○ EPA ○ gov. grants and contracts, plus contract violations ○ federal regulations, and more...
  • 3. Entity Resolution ● datasets get matched to our entity universe ● rarely have unique or common IDs for entities
  • 4. Influence Explorer's Matching Framework ● Uses Name Cleaver for standardization ● Hooks into Django for ORM magic ● Heuristic-based ● Results good, but process is messy ○ Merge matching result tables ○ Export to spreadsheet ○ Humans verify via marking columns, deleting rows ○ Data munged and reimported into DB
  • 6. OpenRefine Features ● Cleaning: various built-in transforms, such as whitespace trimming, case cleanup, etc. ● Faceting: groups records by a column value and lets you see slices of your dataset ● Clustering: uses fuzzy algorithms (e.g. Levenshtein distance) to find, group, and clean up like records ● Reconciling: connects to an external API endpoint to match records with ones in some other dataset
  • 7. How a Reconciliation Service Works Refine sends the column value to the service, and the service looks for potential matches and and sends back ranked results. Service can flag matches with very high confidence. Back in Refine, the user can choose to auto-match those. Refine queries in batches of ten.
  • 8. ReconciliationServiceAPI One endpoint ● GET returns service metadata ● POST asks service for a match or matches http://transparencydata.com/api/1.0/refine/reconcile
  • 9. GET http://transparencydata.com/api/1.0/refine/reconcile service_metadata = { "name": "Influence Explorer Reconciliation", "identifierSpace": "http://influenceexplorer.com/ns/entities", "schemaspace": http://influenceexplorer.com/ns/entity.object.id", "view": { "url": "http://influenceexplorer.com/entity/{{id}}" }, "preview": { "url": "http://influenceexplorer.com/entity/{{id}}", "width": 430, "height": 300 }, "defaultTypes": [ ] }
  • 10. POST Ex. 1 queries= { "q0": {"query":"GELMAN, MATTHEW", "type":"individual" , "type_strict": "should"}, "q1":{"query":"VAN DONGEN, DIRK W. MR.", "type":"individual" ,"type_strict":"should"}, "q2":{"query":"PAXON, L WILLIAM", "type":"individual","type_strict":"should"} }
  • 11. POST Ex. 1 Result { "q1": { "result": [{"score": 2, "type": ["individual"], "id": "6b2dc2da3e144aab802e5ea28a9b4330", "match": false, "name": "Dirk Van Dongen"}] }, "q0": { "result": [{"score": 1.7, "type": ["individual"], "id": "40a776e9833e47c9830490b8be21d7d3", "match": false, "name": "Matt Gelman"}]}, "q2": {"result": []} }
  • 12. POST Ex. 2: types from dataset col. queries={ "q0":{"query":"Coca-Cola Enterprises", "properties":[{"pid":"contributionType","v":" Corporation"}]}, "q1":{"query":"Coca-Cola Enterprises Inc", "properties": [{"pid":"contributionType","v":"Corporation"}]}, "q2":{"query":"Coca-Coca Enterprises,Inc." ,"properties": [{"pid":"contributionType","v":"Corporation"}]}, "q3":{"query":"Coca-Cola Company", "properties": [{"pid":"contributionType","v":"Corporation"}]} }
  • 13. POST Ex. 2 Results {"q1": {"result": [ {"score": 4, "type": ["organization"], "id": "be61489cc7524b80b7672c9db1eb1aad", "match": true, "name": "Coca-Cola Co"}, ... {"score": 4, "type": ["organization"], "id": "809977921c834a93a2a5ff27364f614f", "match": true, "name": "Coca-Cola Bottlers Assn"}, {"score": 2, "type": ["organization"], "id": "ec4fa3ee098b4a64ae5da8d61f2034c9", "match": false, "name": "Philadelphia Coca-Cola Bottling"}, {"score": 2, "type": ["organization"], "id": "ef9539d369994c15a9653ec218c29d17", "match": false, "name": "Florida Coca-Cola Bottling Co"} ]}, "q0": {"result": [ {"score": 4, "type": ["organization"], "id": ... }]} }
  • 14. Woes ● Documentation ● Freebase-centric ● Service metadata is ill-defined and described ● Not RESTful ● Different formats to support for single and multiple requests ● Very few (if any) extant examples! ● Bad error handling ○ Tip: after adding a non-functional dev RS API, delete ~/.local/share/google/refine to fully refresh ○ Watching Refine's log and having good logging in your service are essential!
  • 15. Back in Refine... After reconciling: ● verification ● entity preview ○ Could be an additional Preview API, but we fudged it. ● export ○ Can break out relevant values from reconciliation results using Refine's JS-y language ■ cell.recon.match.name (arbitrary info from our service!) ■ cell.recon.match.id ○ Don't use standard export, or you'll only get URLs in the column
  • 16. Demo
  • 17. Future ● Adding support for API keys ● Opening up to public (??) ● Extraction reusable components ○ Query parsing ○ Match/results formatting ● Establish conventions ○ schema for contextual data (e.g. party, district, state for politicians), for more flexible and better matching