• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world
 

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

on

  • 1,386 views

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

Statistics

Views

Total Views
1,386
Views on SlideShare
745
Embed Views
641

Actions

Likes
3
Downloads
14
Comments
0

10 Embeds 641

http://gardenandalibrary.blogspot.ca 423
http://gardenandalibrary.blogspot.com 167
http://gardenandalibrary.blogspot.ru 15
http://gardenandalibrary.blogspot.co.uk 14
http://gardenandalibrary.blogspot.com.au 10
http://gardenandalibrary.blogspot.in 7
http://gardenandalibrary.blogspot.com.es 2
http://gardenandalibrary.blogspot.se 1
http://gardenandalibrary.blogspot.no 1
http://gardenandalibrary.blogspot.it 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world Presentation Transcript

    • OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP ANDLINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE
    • SAMPLE PROJECT: NORDIC AMERICAN IMPRINTSSituation: Wanted to match publishers of our books against alist of important Nordic American Publishers (compiled by PennyHuf fman) to find materials for our special collections.Problem: Hard to compare when publication info is notcontrolled:
    • ANSWER: GOOGLE REFINE!Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!
    • CREATE YOUR PROJECT USING ANY SPREADSHEET
    • USE “COMMON TRANSFORMS” TO FIX“WHITESPACE” PROBLEMS IN A SINGLE CLICK
    • 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)
    • 4. REPEAT COMMON TRANSFORMS
    • 5. CLUSTER AND EDIT
    • (THIS IS WHERE THE MAGIC HAPPENS)
    • FUNCTION 1: FINGERPRINT (MOST RELIABLE)
    • NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESSRELIABILIT Y AS YOU DECREASE NGRAM SIZE)
    • PHONETIC MATCHING(ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)
    • (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)
    • NEAREST NEIGHBOR (PPM) MATCHING(SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)
    • (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)
    • AFTER USING OTHER METHODS, RUNTHROUGH FINGERPRINT AND NGRAM AGAIN
    • BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED
    • 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES
    • YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS
    • CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE
    • OTHER CLEAN-UP WE DID: PUBLISHERS
    • OTHER CLEAN-UP WE DID: GIFT NOTES
    • ALSO WORKS FOR NUMBERS/DATES
    • END RESULT? Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500. This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated). The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.
    • BUT WAIT! THERE’S MORE!! LINKED DATA!!!
    • FREEBASE IS THE DEFAULT SERVICE(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)
    • CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED
    • FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic
    • OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM
    • EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!
    • CHOOSE WHAT INFO YOU WANT TO ADD
    • THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET
    • TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE:Browse the properties at: http://schemas.freebaseapps.com /
    • MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)
    • SPARQL ENDPOINTS Install the RDF Extension for Google Refine http://refine.deri.ie/ SPARQL Endpoints http://labs.mondeca.com/sparqlEndpointsStatus/index.html CKAN Data Hub: http://datahub.io/dataset/
    • ADD SPARQL-BASED RECONCILIATION SERVICE
    • THANK YOU!Questions?Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.comI’m also happy to take questions by e- mail weekss@stolaf.edu