Your SlideShare is downloading. ×
OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

1,627
views

Published on

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

Published in: Technology

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,627
On Slideshare
0
From Embeds
0
Number of Embeds
12
Actions
Shares
0
Downloads
15
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP ANDLINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
  • 2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTSSituation: Wanted to match publishers of our books against alist of important Nordic American Publishers (compiled by PennyHuf fman) to find materials for our special collections.Problem: Hard to compare when publication info is notcontrolled:
  • 3. ANSWER: GOOGLE REFINE!Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
  • 4. CREATE YOUR PROJECT USING ANY SPREADSHEET
  • 5. USE “COMMON TRANSFORMS” TO FIX“WHITESPACE” PROBLEMS IN A SINGLE CLICK
  • 6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
  • 7. 4. REPEAT COMMON TRANSFORMS
  • 8. 5. CLUSTER AND EDIT
  • 9. (THIS IS WHERE THE MAGIC HAPPENS)
  • 10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)
  • 11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESSRELIABILIT Y AS YOU DECREASE NGRAM SIZE)
  • 12. PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
  • 13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
  • 14. NEAREST NEIGHBOR (PPM) MATCHING(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
  • 15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
  • 16. AFTER USING OTHER METHODS, RUNTHROUGH FINGERPRINT AND NGRAM AGAIN
  • 17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
  • 18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
  • 19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
  • 20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
  • 21. OTHER CLEAN-UP WE DID: PUBLISHERS
  • 22. OTHER CLEAN-UP WE DID: GIFT NOTES
  • 23. ALSO WORKS FOR NUMBERS/DATES
  • 24. END RESULT? Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500. This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated). The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
  • 25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!
  • 26. FREEBASE IS THE DEFAULT SERVICE(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
  • 27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
  • 28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
  • 29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
  • 30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
  • 31. CHOOSE WHAT INFO YOU WANT TO ADD
  • 32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
  • 33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE:Browse the properties at: http://schemas.freebaseapps.com /
  • 34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
  • 35. SPARQL ENDPOINTS Install the RDF Extension for Google Refine http://refine.deri.ie/ SPARQL Endpoints http://labs.mondeca.com/sparqlEndpointsStatus/index.html CKAN Data Hub: http://datahub.io/dataset/
  • 36. ADD SPARQL-BASED RECONCILIATION SERVICE
  • 37. THANK YOU!Questions?Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.comI’m also happy to take questions by e- mail weekss@stolaf.edu