OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

•Download as PPTX, PDF•

4 likes•4,001 views

Sarah Weeks

A tutorial on using Open Refine based on a sample project of standardizing the names of cities of publication.

Technology

OMG! MY METADATA IS AS
FRESH AS THE BACKSTREET
BOYS: HOW GOOGLE REFINE
CAN UPDATE, CLEAN UP AND
LINK YOUR METADATA TO THE
WIDER WORLD
SARAH BETH WEEKS

LIBRARY TECHNOLOGY CONFERENCE 2013

WEEKSS@STOLAF.EDU
@RASCALWHALE

SAMPLE PROJECT: NORDIC AMERICAN
IMPRINTS

Situation: Wanted to match publishers of our books against a
list of important Nordic American Publishers (compiled by Penny
Huf fman) to find materials for our special collections.
Problem: Hard to compare when publication info is not
controlled:

ANSWER: GOOGLE REFINE!

Google Refine can “match and
merge” messy data filled with:
Random, leading or trailing spaces
stray punctuation
typos
odd capitalization
 and more!

CREATE YOUR PROJECT USING ANY
SPREADSHEET

USE “COMMON TRANSFORMS” TO FIX
“WHITESPACE” PROBLEMS IN A SINGLE CLICK

3. CLEAN UP STRAY CHARACTERS ([].?:) USING
“TRANSFORM” AND REGULAR EXPRESSIONS
(OR JUST USE EXCEL FIND AND REPLACE FOR THIS)

NGRAM METHOD
(STILL RELIABLE: MORE MATCHES BUT LESS
RELIABILIT Y AS YOU DECREASE NGRAM SIZE)

PHONETIC MATCHING
(ESPECIALLY USEFUL WHEN DEALING WITH
TRANSLATED TEXT)

(MORE FALSE MATCHES TO WATCH FOR
WITH PHONETIC FUNCTIONS)

NEAREST NEIGHBOR (PPM) MATCHING
(SLOWER AND MORE FALSE MATCHES BUT
CATCHES WHAT OTHER METHODS MISS)

(SET RADIUS HIGHER, BLOCK CHARACTERS
LOWER TO GENERATE MORE MATCHES)

AFTER USING OTHER METHODS, RUN
THROUGH FINGERPRINT AND NGRAM AGAIN

BE AWARE THAT THINGS THAT WEREN’T
CLUSTERED WON’T HAVE BEEN FIXED

6. USE THE TEXT FACET TO SEE ALL
UNIQUE VALUES

YOU CAN SCROLL THROUGH THE LIST TO
SPOT CHECK FOR PROBLEMS

CLICK EDIT TO T YPE NEW TEXT FOR ALL
CELLS WITH THIS VALUE

END RESULT?

 Using Google Refine we were able to reduce the
3230 unique values for city (260|a) to just 1153. For
publishers (260|b) we went from 11342 unique
names for publishers to approximately 6500.
 This project helped to identify over 2,000 potential
candidates for our Nordic American Imprints
collection. (These are still being evaluated).
 The controlled publishers, cities of publications and
dates will be added to a local 9xx field for faceting in
our future special collections discover tool. Users will
be able to browse our Nordic American Imprints
collection by publisher, city or state.

FREEBASE IS THE DEFAULT SERVICE
(WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)

CHOOSE THE RIGHT “T YPE” AND MOST
CELLS WILL BE AUTO-MATCHED

FOR THE REST CLICK THE OPTIONS TO
SEE WHAT EACH REPRESENTS
 Then click “Match All Identical Cells” (or double checkmarks)
to link all cells with this text to this Freebase topic

OR “SEARCH FOR MATCH” TO BRING UP
AN AUTO-FILL LIST TO CHOOSE FROM

EVEN COOLER: NOW YOU CAN BRING
DATA IN FROM FREEBASE!

THIS NEW DATA IS NOW ADDED TO YOUR
SPREADSHEET

TO SEE WHAT COLUMNS (DATA) YOU CAN
ADD FROM FREEBASE:
Browse the properties at: http://schemas.freebaseapps.com /

MATCH LOCAL SUBJECT HEADING TO LC
(FREEYOURMETADATA.ORG)

SPARQL ENDPOINTS

 Install the RDF Extension for Google Refine
http://refine.deri.ie/

 SPARQL Endpoints
 http://labs.mondeca.com/sparqlEndpointsStatus/index.html
 CKAN Data Hub: http://datahub.io/dataset/

THANK YOU!

Questions?

Link to a public version of this presentation
at my (personal) blog:
gardenandalibrary.blogspot.com
I’m also happy to take questions by e-
mail
weekss@stolaf.edu

Presented by Rachel Tillay and Mike Waugh, LSU Libraries Is your data running loose in your library? OpenRefine is a tool that can help libraries more easily view, analyze, clean, and match large data sets. It is particularly useful for digital projects, statistics, or user data. This presentation will describe how OpenRefine is different from spreadsheets, datasheets, and programming. It will also include demonstrations of some of the most useful functions in OpenRefine, compatible tools, and solutions it provides to would-be data wranglers. Examples will include real-life problems that LSU Libraries has encountered in its cataloging and digital projects.

OpenRefine Tutorial

Alex Petralia

Beautiful Research Data (Structured Data and Open Refine)

Digital Scholarship Unit at the UTSC Library

http://serai.utsc.utoronto.ca/rrsi2014 "Unlike traditional academic conferences, the Roots & Routes Summer Institute features a combination of informal presentations, seminar-style discussions of shared materials, hands-on workshops on a variety of digital tools, and small-group project development sessions. The institute welcomes participants from a range of disciplines with an interest in engaging with digital scholarship; technical experience is not a requirement. Graduate students (MA and PhD), postdoctoral fellows and faculty are all encouraged to apply."

Let your data shine... with OpenRefine

Open Knowledge Belgium

TXDHC OpenRefine Training

Liz Grumbach

Introduction to OpenRefine

Heather Myers

Central Pennsylvania Open Source Conference, October 17, 2015 Data is a hot topic in the tech sector with big data, data processing, data science, linked open data and data visualization to name only a few examples. Before data can be processed or analyzed it often has to be cleaned. OpenRefine is an open source interactive data transformation tool for working with messy data. This presentation will begin with a short overview of the features of OpenRefine. To demonstrate basic concepts of data cleaning, manipulating, faceting and filtering with OpenRefine, Pennsylvania Heritage magazine subject index data will be used as a case study.

OpenRefine Class TutorialAshwin Dinoriya

Invited talk at USEWOD2014 (http://people.cs.kuleuven.be/~bettina.berendt/USEWOD2014/) A tremendous amount of machine-interpretable information is available in the Linked Open Data Cloud. Unfortunately, much of this data remains underused as machine clients struggle to use the Web. I believe this can be solved by giving machines interfaces similar to those we offer humans, instead of separate interfaces such as SPARQL endpoints. In this talk, I'll discuss the Linked Data Fragments vision on machine access to the Web of Data, and indicate how this impacts usage analysis of the LOD Cloud. We all can learn a lot from how humans access the Web, and those strategies can be applied to querying and analysis. In particular, we have to focus first on solving those use cases that humans can do easily, and only then consider tackling others.

The Digital Cavemen of Linked Lascaux

Ruben Verborgh

Live DBpedia querying with high availability

Ruben Verborgh

Semantic web application architecture

Don Willems

Using entity extraction extension with OpenRefine and Dandelion API

SpazioDati

Querying data on the Web – client or server?

Ruben Verborgh

Initial Usage Analysis of DBpedia's Triple Pattern Fragments

Ruben Verborgh

Consuming Linked Data 4/5 Semtech2011Juan Sequeda

Querying datasets on the Web with high availability

Ruben Verborgh

Creating 3rd Generation Web APIs with Hydra

Markus Lanthaler

Done reread detecting phrase-level duplication on the world wide weJames Arnold

The Future is Federated

Ruben Verborgh

Web data from Rschamber

ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

SpazioDati

Asp.Net The Data List Control

Ram Sagar Mourya

Talis Platform: A Linked Data Engine

Leigh Dodds

Text Analytics Online Knowledge Base / Database

Naveen Kumar

Reasoned SPARQLRuben Verborgh

CEK KEMIRIPAN PADA CROSSREF

Relawan Jurnal Indonesia

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL

KohaGruppoItaliano

Lecture 2 part 3

Jazan University

What's hot

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio

Open Knowledge Belgium

The Lonesome LOD Cloud

Ruben Verborgh

The Digital Cavemen of Linked Lascaux

Ruben Verborgh

Live DBpedia querying with high availability

Ruben Verborgh

Semantic web application architecture

Don Willems

Using entity extraction extension with OpenRefine and Dandelion API

SpazioDati

Querying data on the Web – client or server?

Ruben Verborgh

Initial Usage Analysis of DBpedia's Triple Pattern Fragments

Ruben Verborgh

Consuming Linked Data 4/5 Semtech2011Juan Sequeda

Querying datasets on the Web with high availability

Ruben Verborgh

Creating 3rd Generation Web APIs with Hydra

Markus Lanthaler

Done reread detecting phrase-level duplication on the world wide weJames Arnold

The Future is Federated

Ruben Verborgh

Web data from Rschamber

ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

SpazioDati

Asp.Net The Data List Control

Ram Sagar Mourya

Talis Platform: A Linked Data Engine

Leigh Dodds

Text Analytics Online Knowledge Base / Database

Naveen Kumar

Reasoned SPARQLRuben Verborgh

CEK KEMIRIPAN PADA CROSSREF

Relawan Jurnal Indonesia

What's hot (20)

Do it on your own - From 3 to 5 Star Linked Open Data with RMLio

The Lonesome LOD Cloud

The Digital Cavemen of Linked Lascaux

Live DBpedia querying with high availability

Semantic web application architecture

Using entity extraction extension with OpenRefine and Dandelion API

Querying data on the Web – client or server?

Initial Usage Analysis of DBpedia's Triple Pattern Fragments

Consuming Linked Data 4/5 Semtech2011

Querying datasets on the Web with high availability

Creating 3rd Generation Web APIs with Hydra

Done reread detecting phrase-level duplication on the world wide we

The Future is Federated

Web data from R

ISWC 2014 - Dandelion: from raw data to dataGEMs for developers

Asp.Net The Data List Control

Talis Platform: A Linked Data Engine

Text Analytics Online Knowledge Base / Database

Reasoned SPARQL

CEK KEMIRIPAN PADA CROSSREF

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL

KohaGruppoItaliano

Lecture 2 part 3

Jazan University

The Power of Semantic Technologies to Explore Linked Open Data

Ontotext

Atanas Kiryakov's, Ontotext’s CEO, presentation at the first edition of Graphorum (http://graphorum2017.dataversity.net/) – a new forum that taps into the growing interest in Graph Databases and Technologies. Graphorum is co-located with the Smart Data Conference, organized by the digital publishing platform Dataversity. The presentation demonstrates the capabilities of Ontotext’s own approach to contributing to the discipline of more intelligent information gathering and analysis by: - graphically explorinh the connectivity patterns in big datasets; - building new links between identical entities residing in different data silos; - getting insights of what type of queries can be run against various linked data sets; - reliably filtering information based on relationships, e.g., between people and organizations, in the news; - demonstrating the conversion of tabular data into RDF. Learn more at http://ontotext.com/.

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

Jeff Magnusson

Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users. From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.

AnzoGraph DB - SPARQL 101

Cambridge Semantics

A brief history of "big data"

Nicola Ferraro

Hadoop with Python

Donald Miner

Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Rohit Kulkarni

Graph databases: Tinkerpop and Titan DB

Mohamed Taher Alrefaie

Splunk bsidesMacy Cronkrite

Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015

Codemotion

Codemotion Rome 2015 - I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.

Search Engines After The Semanatic Websamar_slideshare

The Business Case for Semantic Web Ontology & Knowledge Graph

Cambridge Semantics

Why MongoDB over other Databases - Habilelabs

Habilelabs

Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014

Codemotion

I Big Data sono indubbiamente tra i temi più "caldi" del panorama tecnologico attuale. Ad oggi nel mondo sono stati prodotti circa 5 Exabytes di dati che costituiscono una potenziale fonte di "intelligenza" che è possibile sfruttare, grazie alle tecnologie più recenti, in svariati ambiti che spaziano dalla medicina alla sociologia passando per il marketing. Il talk si propone, tramite una gita virtuale nello spazio, di introdurre i concetti, le tecniche e gli strumenti che consentono di iniziare a sfruttare il potenziale dei Big Data nel lavoro quotidiano.

Hadoop Interview Questions and Answers

Big Data Interview Questions

Here is our most popular Hadoop Interview Questions and Answers from our Hadoop Developer Interview Guide. Hadoop Developer Interview Guide has over 100 REAL Hadoop Developer Interview Questions with detailed answers and illustrations asked in REAL interviews. The Hadoop Interview Questions listed in the guide are not "might be" asked interview question, they were asked in interviews at least once.

Visualizations using Visualbox

Alvaro Graves

OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG

Lucidworks

Another Intro To Hadoop

Adeel Ahmad

3 map reduce perspectivesGenoveva Vargas-Solar

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world (20)

Joy Nelson - Workshop on BIBFRAME, RDF and SPAQL

Lecture 2 part 3

The Power of Semantic Technologies to Explore Linked Open Data

Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)

AnzoGraph DB - SPARQL 101

A brief history of "big data"

Hadoop with Python

Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy

Graph databases: Tinkerpop and Titan DB

Splunk bsides

Big Data, a space adventure - Mario Cartia - Codemotion Rome 2015

Search Engines After The Semanatic Web

The Business Case for Semantic Web Ontology & Knowledge Graph

Why MongoDB over other Databases - Habilelabs

Big Data, a space adventure - Mario Cartia - Codemotion Milan 2014

Hadoop Interview Questions and Answers

Visualizations using Visualbox

OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AG

Another Intro To Hadoop

3 map reduce perspectives

Recently uploaded

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Paige Cruz

Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack. While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack. I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

Monitoring Java Application Security with JDK Tools and JFR Events

Ana-Maria Mihalceanu

UiPath Test Automation using UiPath Test Suite series, part 4

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap. The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies. Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques What will you get from this session? 1. Insights into SAP testing best practices 2. Heatmap utilization for testing 3. Optimization of testing processes 4. Demo Topics covered: Execution from the test manager Orchestrator execution result Defect reporting SAP heatmap example with demo Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Peter Spielvogel

Building better applications for business users with SAP Fiori. • What is SAP Fiori and why it matters to you • How a better user experience drives measurable business benefits • How to get started with SAP Fiori today • How SAP Fiori elements accelerates application development • How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities • How SAP Fiori paves the way for using AI in SAP apps

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

Free Complete Python - A step towards Data Science

RinaMondal9

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Neo4j

Leonard Jayamohan, Partner & Generative AI Lead, Deloitte This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Recently uploaded (20)

Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

Monitoring Java Application Security with JDK Tools and JFR Events

UiPath Test Automation using UiPath Test Suite series, part 4

The Art of the Pitch: WordPress Relationships and Sales

SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf

Introduction to CHERI technology - Cybersecurity

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Free Complete Python - A step towards Data Science

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

20240607 QFM018 Elixir Reading List May 2024

GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

Removing Uninteresting Bytes in Software Fuzzing

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

1. OMG! MY METADATA IS AS FRESH AS THE BACKSTREET BOYS: HOW GOOGLE REFINE CAN UPDATE, CLEAN UP AND LINK YOUR METADATA TO THE WIDER WORLD SARAH BETH WEEKS LIBRARY TECHNOLOGY CONFERENCE 2013 WEEKSS@STOLAF.EDU @RASCALWHALE

2. SAMPLE PROJECT: NORDIC AMERICAN IMPRINTS Situation: Wanted to match publishers of our books against a list of important Nordic American Publishers (compiled by Penny Huf fman) to find materials for our special collections. Problem: Hard to compare when publication info is not controlled:

3. ANSWER: GOOGLE REFINE! Google Refine can “match and merge” messy data filled with: Random, leading or trailing spaces stray punctuation typos odd capitalization  and more!

4. CREATE YOUR PROJECT USING ANY SPREADSHEET

5. USE “COMMON TRANSFORMS” TO FIX “WHITESPACE” PROBLEMS IN A SINGLE CLICK

6. 3. CLEAN UP STRAY CHARACTERS ([].?:) USING “TRANSFORM” AND REGULAR EXPRESSIONS (OR JUST USE EXCEL FIND AND REPLACE FOR THIS)

7. 4. REPEAT COMMON TRANSFORMS

8. 5. CLUSTER AND EDIT

9. (THIS IS WHERE THE MAGIC HAPPENS)

10. FUNCTION 1: FINGERPRINT (MOST RELIABLE)

11. NGRAM METHOD (STILL RELIABLE: MORE MATCHES BUT LESS RELIABILIT Y AS YOU DECREASE NGRAM SIZE)

12. PHONETIC MATCHING (ESPECIALLY USEFUL WHEN DEALING WITH TRANSLATED TEXT)

13. (MORE FALSE MATCHES TO WATCH FOR WITH PHONETIC FUNCTIONS)

14. NEAREST NEIGHBOR (PPM) MATCHING (SLOWER AND MORE FALSE MATCHES BUT CATCHES WHAT OTHER METHODS MISS)

15. (SET RADIUS HIGHER, BLOCK CHARACTERS LOWER TO GENERATE MORE MATCHES)

16. AFTER USING OTHER METHODS, RUN THROUGH FINGERPRINT AND NGRAM AGAIN

17. BE AWARE THAT THINGS THAT WEREN’T CLUSTERED WON’T HAVE BEEN FIXED

18. 6. USE THE TEXT FACET TO SEE ALL UNIQUE VALUES

19. YOU CAN SCROLL THROUGH THE LIST TO SPOT CHECK FOR PROBLEMS

20. CLICK EDIT TO T YPE NEW TEXT FOR ALL CELLS WITH THIS VALUE

21. OTHER CLEAN-UP WE DID: PUBLISHERS

22. OTHER CLEAN-UP WE DID: GIFT NOTES

23. ALSO WORKS FOR NUMBERS/DATES

24. END RESULT?  Using Google Refine we were able to reduce the 3230 unique values for city (260|a) to just 1153. For publishers (260|b) we went from 11342 unique names for publishers to approximately 6500.  This project helped to identify over 2,000 potential candidates for our Nordic American Imprints collection. (These are still being evaluated).  The controlled publishers, cities of publications and dates will be added to a local 9xx field for faceting in our future special collections discover tool. Users will be able to browse our Nordic American Imprints collection by publisher, city or state.

25. BUT WAIT! THERE’S MORE!! LINKED DATA!!!

26. FREEBASE IS THE DEFAULT SERVICE (WIKIPEDIA -ESQUE DATA OWNED BY GOOGLE)

27. CHOOSE THE RIGHT “T YPE” AND MOST CELLS WILL BE AUTO-MATCHED

28. FOR THE REST CLICK THE OPTIONS TO SEE WHAT EACH REPRESENTS  Then click “Match All Identical Cells” (or double checkmarks) to link all cells with this text to this Freebase topic

29. OR “SEARCH FOR MATCH” TO BRING UP AN AUTO-FILL LIST TO CHOOSE FROM

30. EVEN COOLER: NOW YOU CAN BRING DATA IN FROM FREEBASE!

31. CHOOSE WHAT INFO YOU WANT TO ADD

32. THIS NEW DATA IS NOW ADDED TO YOUR SPREADSHEET

33. TO SEE WHAT COLUMNS (DATA) YOU CAN ADD FROM FREEBASE: Browse the properties at: http://schemas.freebaseapps.com /

34. MATCH LOCAL SUBJECT HEADING TO LC (FREEYOURMETADATA.ORG)

35. SPARQL ENDPOINTS  Install the RDF Extension for Google Refine http://refine.deri.ie/  SPARQL Endpoints  http://labs.mondeca.com/sparqlEndpointsStatus/index.html  CKAN Data Hub: http://datahub.io/dataset/

36. ADD SPARQL-BASED RECONCILIATION SERVICE

37. THANK YOU! Questions? Link to a public version of this presentation at my (personal) blog: gardenandalibrary.blogspot.com I’m also happy to take questions by e- mail weekss@stolaf.edu

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world

Similar to OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world (20)

Recently uploaded

Recently uploaded (20)

OMG! My metadata is as fresh as the Backstreet Boys: How Google Refine can update, clean up and link your metadata to the wider world