OMG! MY METADATA IS AS  FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP ANDLINK YOUR METADATA TO THE ...
SAMPLE PROJECT: NORDIC AMERICAN                IMPRINTSSituation: Wanted to match publishers of our books against alist of...
ANSWER: GOOGLE REFINE!Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces st...
CREATE YOUR PROJECT USING ANY        SPREADSHEET
USE “COMMON TRANSFORMS” TO FIX“WHITESPACE” PROBLEMS IN A SINGLE CLICK
3. CLEAN UP STRAY CHARACTERS ([].?:) USING   “TRANSFORM” AND REGULAR EXPRESSIONS(OR JUST USE EXCEL FIND AND REPLACE FOR TH...
4. REPEAT COMMON TRANSFORMS
5. CLUSTER AND EDIT
(THIS IS WHERE THE MAGIC HAPPENS)
FUNCTION 1: FINGERPRINT    (MOST RELIABLE)
NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESSRELIABILIT Y AS YOU DECREASE NGRAM SIZE)
PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING WITH          TRANSLATED TEXT)
(MORE FALSE MATCHES TO WATCH FOR    WITH PHONETIC FUNCTIONS)
NEAREST NEIGHBOR (PPM) MATCHING(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
(SET RADIUS HIGHER, BLOCK CHARACTERS  LOWER TO GENERATE MORE MATCHES)
AFTER USING OTHER METHODS, RUNTHROUGH FINGERPRINT AND NGRAM AGAIN
BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
6. USE THE TEXT FACET TO SEE ALL         UNIQUE VALUES
YOU CAN SCROLL THROUGH THE LIST TO     SPOT CHECK FOR PROBLEMS
CLICK EDIT TO T YPE NEW TEXT FOR ALL       CELLS WITH THIS VALUE
OTHER CLEAN-UP WE DID:     PUBLISHERS
OTHER CLEAN-UP WE DID:      GIFT NOTES
ALSO WORKS FOR NUMBERS/DATES
END RESULT? Using Google Refine we were able to reduce the  3230 unique values for city (260|a) to just 1153. For  publis...
BUT WAIT! THERE’S MORE!!     LINKED DATA!!!
FREEBASE IS THE DEFAULT SERVICE(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
CHOOSE THE RIGHT “T YPE” AND MOST   CELLS WILL BE AUTO-MATCHED
FOR THE REST CLICK THE OPTIONS TO     SEE WHAT EACH REPRESENTS Then click “Match All Identical Cells” (or double checkmar...
OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
EVEN COOLER: NOW YOU CAN BRING    DATA IN FROM FREEBASE!
CHOOSE WHAT INFO YOU WANT TO ADD
THIS NEW DATA IS NOW ADDED TO YOUR           SPREADSHEET
TO SEE WHAT COLUMNS (DATA) YOU CAN        ADD FROM FREEBASE:Browse the properties at: http://schemas.freebaseapps.com /
MATCH LOCAL SUBJECT HEADING TO LC    (FREEYOURMETADATA.ORG)
SPARQL ENDPOINTS Install the RDF Extension for Google Refine  http://refine.deri.ie/ SPARQL Endpoints http://labs.monde...
ADD SPARQL-BASED RECONCILIATION            SERVICE
THANK YOU!Questions?Link to a public version of this presentation at my (personal) blog:     gardenandalibrary.blogspot....
Upcoming SlideShare
Loading in …5
×

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

2,718 views

Published on

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,718
On SlideShare
0
From Embeds
0
Number of Embeds
1,216
Actions
Shares
0
Downloads
23
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

  1. 1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP ANDLINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
  2. 2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTSSituation: Wanted to match publishers of our books against alist of important Nordic American Publishers (compiled by PennyHuf fman) to find materials for our special collections.Problem: Hard to compare when publication info is notcontrolled:
  3. 3. ANSWER: GOOGLE REFINE!Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
  4. 4. CREATE YOUR PROJECT USING ANY SPREADSHEET
  5. 5. USE “COMMON TRANSFORMS” TO FIX“WHITESPACE” PROBLEMS IN A SINGLE CLICK
  6. 6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
  7. 7. 4. REPEAT COMMON TRANSFORMS
  8. 8. 5. CLUSTER AND EDIT
  9. 9. (THIS IS WHERE THE MAGIC HAPPENS)
  10. 10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)
  11. 11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESSRELIABILIT Y AS YOU DECREASE NGRAM SIZE)
  12. 12. PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
  13. 13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
  14. 14. NEAREST NEIGHBOR (PPM) MATCHING(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
  15. 15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
  16. 16. AFTER USING OTHER METHODS, RUNTHROUGH FINGERPRINT AND NGRAM AGAIN
  17. 17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
  18. 18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
  19. 19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
  20. 20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
  21. 21. OTHER CLEAN-UP WE DID: PUBLISHERS
  22. 22. OTHER CLEAN-UP WE DID: GIFT NOTES
  23. 23. ALSO WORKS FOR NUMBERS/DATES
  24. 24. END RESULT? Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500. This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated). The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
  25. 25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!
  26. 26. FREEBASE IS THE DEFAULT SERVICE(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
  27. 27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
  28. 28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
  29. 29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
  30. 30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
  31. 31. CHOOSE WHAT INFO YOU WANT TO ADD
  32. 32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
  33. 33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE:Browse the properties at: http://schemas.freebaseapps.com /
  34. 34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
  35. 35. SPARQL ENDPOINTS Install the RDF Extension for Google Refine http://refine.deri.ie/ SPARQL Endpoints http://labs.mondeca.com/sparqlEndpointsStatus/index.html CKAN Data Hub: http://datahub.io/dataset/
  36. 36. ADD SPARQL-BASED RECONCILIATION SERVICE
  37. 37. THANK YOU!Questions?Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.comI’m also happy to take questions by e- mail weekss@stolaf.edu

×