SlideShare a Scribd company logo
OMG! MY METADATA IS AS
  FRESH AS THE BACKSTREET
 BOYS: HOW GOOGLE REFINE
 CAN UPDATE, CLEAN UP AND
LINK YOUR METADATA TO THE
             WIDER WORLD
                 SARAH BETH WEEKS

   LIBRARY TECHNOLOGY CONFERENCE 2013

                   WEEKSS@STOLAF.EDU
                       @RASCALWHALE
SAMPLE PROJECT: NORDIC AMERICAN
                IMPRINTS

Situation: Wanted to match publishers of our books against a
list of important Nordic American Publishers (compiled by Penny
Huf fman) to find materials for our special collections.
Problem: Hard to compare when publication info is not
controlled:
ANSWER: GOOGLE REFINE!

Google Refine can “match and
 merge” messy data filled with:
 Random, leading or trailing spaces
 stray punctuation
 typos
 odd capitalization
  and more!
CREATE YOUR PROJECT USING ANY
        SPREADSHEET
USE “COMMON TRANSFORMS” TO FIX
“WHITESPACE” PROBLEMS IN A SINGLE CLICK
3. CLEAN UP STRAY CHARACTERS ([].?:) USING
   “TRANSFORM” AND REGULAR EXPRESSIONS
(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
4. REPEAT COMMON TRANSFORMS
5. CLUSTER AND EDIT
(THIS IS WHERE THE MAGIC HAPPENS)
FUNCTION 1: FINGERPRINT
    (MOST RELIABLE)
NGRAM METHOD
 (STILL RELIABLE: MORE MATCHES BUT LESS
RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
PHONETIC MATCHING
(ESPECIALLY USEFUL WHEN DEALING WITH
          TRANSLATED TEXT)
(MORE FALSE MATCHES TO WATCH FOR
    WITH PHONETIC FUNCTIONS)
NEAREST NEIGHBOR (PPM) MATCHING
(SLOWER AND MORE FALSE MATCHES BUT
 CATCHES WHAT OTHER METHODS MISS)
(SET RADIUS HIGHER, BLOCK CHARACTERS
  LOWER TO GENERATE MORE MATCHES)
AFTER USING OTHER METHODS, RUN
THROUGH FINGERPRINT AND NGRAM AGAIN
BE AWARE THAT THINGS THAT WEREN’T
 CLUSTERED WON’T HAVE BEEN FIXED
6. USE THE TEXT FACET TO SEE ALL
         UNIQUE VALUES
YOU CAN SCROLL THROUGH THE LIST TO
     SPOT CHECK FOR PROBLEMS
CLICK EDIT TO T YPE NEW TEXT FOR ALL
       CELLS WITH THIS VALUE
OTHER CLEAN-UP WE DID:
     PUBLISHERS
OTHER CLEAN-UP WE DID:
      GIFT NOTES
ALSO WORKS FOR NUMBERS/DATES
END RESULT?

 Using Google Refine we were able to reduce the
  3230 unique values for city (260|a) to just 1153. For
  publishers (260|b) we went from 11342 unique
  names for publishers to approximately 6500.
 This project helped to identify over 2,000 potential
  candidates for our Nordic American Imprints
  collection. (These are still being evaluated).
 The controlled publishers, cities of publications and
  dates will be added to a local 9xx field for faceting in
  our future special collections discover tool. Users will
  be able to browse our Nordic American Imprints
  collection by publisher, city or state.
BUT WAIT! THERE’S MORE!!
     LINKED DATA!!!
FREEBASE IS THE DEFAULT SERVICE
(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
CHOOSE THE RIGHT “T YPE” AND MOST
   CELLS WILL BE AUTO-MATCHED
FOR THE REST CLICK THE OPTIONS TO
     SEE WHAT EACH REPRESENTS
 Then click “Match All Identical Cells” (or double checkmarks)
  to link all cells with this text to this Freebase topic
OR “SEARCH FOR MATCH” TO BRING UP
 AN AUTO-FILL LIST TO CHOOSE FROM
EVEN COOLER: NOW YOU CAN BRING
    DATA IN FROM FREEBASE!
CHOOSE WHAT INFO YOU WANT TO ADD
THIS NEW DATA IS NOW ADDED TO YOUR
           SPREADSHEET
TO SEE WHAT COLUMNS (DATA) YOU CAN
        ADD FROM FREEBASE:
Browse the properties at: http://schemas.freebaseapps.com /
MATCH LOCAL SUBJECT HEADING TO LC
    (FREEYOURMETADATA.ORG)
SPARQL ENDPOINTS

 Install the RDF Extension for Google Refine
  http://refine.deri.ie/




 SPARQL Endpoints
 http://labs.mondeca.com/sparqlEndpointsStatus/index.html
 CKAN Data Hub: http://datahub.io/dataset/
ADD SPARQL-BASED RECONCILIATION
            SERVICE
THANK YOU!

Questions?

Link to a public version of this presentation
 at my (personal) blog:
     gardenandalibrary.blogspot.com
I’m also happy to take questions by e-
 mail
              weekss@stolaf.edu

More Related Content

What's hot

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Open Knowledge Belgium
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
Ruben Verborgh
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
Ruben Verborgh
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
Ruben Verborgh
 
Semantic web application architecture
Semantic web   application architectureSemantic web   application architecture
Semantic web application architecture
Don Willems
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
SpazioDati
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
Ruben Verborgh
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Ruben Verborgh
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Juan Sequeda
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
Ruben Verborgh
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
Markus Lanthaler
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weJames Arnold
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
Ruben Verborgh
 
Web data from R
Web data from RWeb data from R
Web data from Rschamber
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
SpazioDati
 
Asp.Net The Data List Control
Asp.Net   The Data List ControlAsp.Net   The Data List Control
Asp.Net The Data List Control
Ram Sagar Mourya
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
Leigh Dodds
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
Naveen Kumar
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
Relawan Jurnal Indonesia
 

What's hot (20)

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 
The Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked LascauxThe Digital Cavemen of Linked Lascaux
The Digital Cavemen of Linked Lascaux
 
Live DBpedia querying with high availability
Live DBpedia querying with high availabilityLive DBpedia querying with high availability
Live DBpedia querying with high availability
 
Semantic web application architecture
Semantic web   application architectureSemantic web   application architecture
Semantic web application architecture
 
Using entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion APIUsing entity extraction extension with OpenRefine and Dandelion API
Using entity extraction extension with OpenRefine and Dandelion API
 
Querying data on the Web – client or server?
Querying data on the Web – client or server?Querying data on the Web – client or server?
Querying data on the Web – client or server?
 
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern FragmentsInitial Usage Analysis of DBpedia's Triple Pattern Fragments
Initial Usage Analysis of DBpedia's Triple Pattern Fragments
 
Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011Consuming Linked Data 4/5 Semtech2011
Consuming Linked Data 4/5 Semtech2011
 
Querying datasets on the Web with high availability
Querying datasets on the Web with high availabilityQuerying datasets on the Web with high availability
Querying datasets on the Web with high availability
 
Creating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with HydraCreating 3rd Generation Web APIs with Hydra
Creating 3rd Generation Web APIs with Hydra
 
Done reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide weDone reread detecting phrase-level duplication on the world wide we
Done reread detecting phrase-level duplication on the world wide we
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Web data from R
Web data from RWeb data from R
Web data from R
 
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developersISWC 2014 - Dandelion: from raw data to dataGEMs for developers
ISWC 2014 - Dandelion: from raw data to dataGEMs for developers
 
Asp.Net The Data List Control
Asp.Net   The Data List ControlAsp.Net   The Data List Control
Asp.Net The Data List Control
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
 
Text Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / DatabaseText Analytics Online Knowledge Base / Database
Text Analytics Online Knowledge Base / Database
 
Reasoned SPARQL
Reasoned SPARQLReasoned SPARQL
Reasoned SPARQL
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
 

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQLJoy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
KohaGruppoItaliano
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
Cambridge Semantics
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
Nicola Ferraro
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
Mohamed Taher Alrefaie
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Codemotion
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Websamar_slideshare
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
Cambridge Semantics
 
Why MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - HabilelabsWhy MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - Habilelabs
Habilelabs
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Codemotion
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
Big Data Interview Questions
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
Alvaro Graves
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
Lucidworks
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
Adeel Ahmad
 

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world (20)

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQLJoy Nelson - Workshop on BIBFRAME, RDF and SPAQL
Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101AnzoGraph DB - SPARQL 101
AnzoGraph DB - SPARQL 101
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Graph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DBGraph databases: Tinkerpop and Titan DB
Graph databases: Tinkerpop and Titan DB
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
The Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge GraphThe Business Case for Semantic Web Ontology & Knowledge Graph
The Business Case for Semantic Web Ontology & Knowledge Graph
 
Why MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - HabilelabsWhy MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - Habilelabs
 
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014Big Data, a space adventure - Mario Cartia -  Codemotion Milan 2014
Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Visualizations using Visualbox
Visualizations using VisualboxVisualizations using Visualbox
Visualizations using Visualbox
 
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGOLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

  • 1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND LINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
  • 2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huf fman) to find materials for our special collections. Problem: Hard to compare when publication info is not controlled:
  • 3. ANSWER: GOOGLE REFINE! Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
  • 4. CREATE YOUR PROJECT USING ANY SPREADSHEET
  • 5. USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE CLICK
  • 6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS (OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
  • 7. 4. REPEAT COMMON TRANSFORMS
  • 9. (THIS IS WHERE THE MAGIC HAPPENS)
  • 10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)
  • 11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESS RELIABILIT Y AS YOU DECREASE NGRAM SIZE)
  • 12. PHONETIC MATCHING (ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
  • 13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
  • 14. NEAREST NEIGHBOR (PPM) MATCHING (SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
  • 15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
  • 16. AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM AGAIN
  • 17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
  • 18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
  • 19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
  • 20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
  • 21. OTHER CLEAN-UP WE DID: PUBLISHERS
  • 22. OTHER CLEAN-UP WE DID: GIFT NOTES
  • 23. ALSO WORKS FOR NUMBERS/DATES
  • 24. END RESULT?  Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.  This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).  The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
  • 25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!
  • 26. FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
  • 27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
  • 28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS  Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
  • 29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
  • 30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
  • 31. CHOOSE WHAT INFO YOU WANT TO ADD
  • 32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
  • 33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE: Browse the properties at: http://schemas.freebaseapps.com /
  • 34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
  • 35. SPARQL ENDPOINTS  Install the RDF Extension for Google Refine http://refine.deri.ie/  SPARQL Endpoints  http://labs.mondeca.com/sparqlEndpointsStatus/index.html  CKAN Data Hub: http://datahub.io/dataset/
  • 37. THANK YOU! Questions? Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.com I’m also happy to take questions by e- mail weekss@stolaf.edu