MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh my!

We’ll start at the top of the ½
hour
And as a reminder – this session will be recorded (please keep videos off
since I am recording today)

Merging, Clustering, and
Integrations…oh my!
Terry Reese

Topics
This may be a bit of an eclectic session – as I’m interested
in covering a range of topics that are coming up during
this period of remote work.
merging records
Integrations (oclc and alma)
building clustering tools
moving marc data in and out of openrefine

Merge Records
Tool
Allows users to merge MARC data
from two files
Allows users to merge unique data,
selected data and all data.

Merge Records Tool
Merge Record functions with two modes
1. Merging data across two files
◦ Use Case: Merging one or more fields from one file into another based on a match point
2. Merging Duplicate data into a single record set
◦ Use Case: Source file has multiple duplicates or versions of records. The goal is to merge all like
records into one using a match point.

Merging Records
Merging Data Across Two Files
◦ This mode is activated by:
◦ Source File and Merge File are different

Merging Records
Merging Data: Consolidating data in the same
file
◦ This mode is activated by:
◦ Source file and Merge File are the Same

Merging Records
How merging occurs
◦ Matches occur by setting the Identifier
◦ Identifier is either a field or field + subfield
◦ Merges can be multiple fields – delimited by a pipe. For
example: 001|901$a
◦ What if there is no common match point?
◦ You can use the MARC21 Option
◦ Unicode Encoded option chances how field data
is evaluated. By default, values are evaluated as
binary data – but by selecting this option, data
will be evaluated as characters. This is important
when working with multibyte languages

Let’s talk about
Integrations

How MarcEdit Works with Alma
• https://developers.exlibrisgroup.com/alma/apis/bibs
• Because the API is rate limited (i.e., you can only process so many transactions concurrently through
the API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It
takes a little longer, but eliminates the possibility that using MarcEdit to automate workflows will
bring down your system because the tool is trying to communicate with the system too quickly.
MarcEdit works through the following API endpoints:
• Edit holdings data (and Holdings Records)
• Create and Update bibliographic data
• Extract Records
• Though discovery should be done via Z39.50 or SRU (which is preferred)
This this API, MarcEdit can:

Working with OCLC Connexion
https://youtu.be/a7Cen0gxFCw?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg

Call Number
Classifications
Integrates with OCLC Classify Service
◦ Provides LC or Dewy Numbers
◦ Processes data based on OCLC
Number, ISBN, ISSN, Author/Title
Pair
◦ Generates Call Numbers based on
the most widely used call number

Working with
OCLC’s
Metadata API
MARCEDIT CAN WORK DIRECTLY
WITH WORLDCAT VIA THE
METADATA API.

MarcEdit: Batch
WorldCat
Holdings
Management

MarcEdit: Batch
Bibliographic
Record Upload

MarcEdit has
OpenRefine
Options
EXPORT AND IMPORT DATA FROM
MARC INTO FORMATS OPENREFINE
CAN READ AND UNDERSTAND

Cluster Format
Support
MARC support
• The tool provides direct manipulation of MARC data,
with clustering options to edit at the field, or subfield
level, with clustering criteria set at the field/subfield
level
• MARC data process supports any character encoding
Delimited Data Support
• New as of 3/2/2018 (on Windows/Linux; Mac coming)
• Allows for indexing by column
• Initial implementation limits editing to files with 100
or fewer columns
• Delimited formats supported:
• Tab Delimited
• Comma Delimited

Why not just
use
OpenRefine?
If you know how to use OpenRefine, please use it
• OpenRefine provides a much wider and rich set of functionality
and designed specifically to help users deal with data issues found
within largely unstructured data
But MARC is a pain
• While OpenRefine has some MARC importing tools, the tooling
specifically to MARC isn’t particularly good or intuitive
• OpenRefine has some practical data (file size) limits when run on a
desktop/client
MarcEdit’s clustering support was built with catalogers
in mind
• To make working with library formats easy
• To minimize the need to migrate data between different formats
(and risk losing information in the conversion processes)

Clustering in
MarcEdit 7
MarcEdit’s built-in clustering tools
support native grouping and batch
editing and works well on file sizes of
a million records and smaller (can
work on large sets, but the larger the
file, the longer the cluster operation
takes)

Clustering
Options
• Levenshtein Distance
• This algorithm is best for people, places, and subjects
• This algorithm builds clusters based on the number of
positions/character difference between a word or phase
• This algorithm is generally faster
• Composite Coefficient
• This algorithm is best for highly variable data where a great deal
of fuzziness is desired.
Clustering Algorithms
• Default – normalized data tokens
• Reese, Terry and Terry Reese become reeseterry and terryreese
• Works best for spelling errors, form changes
• Fingerprint tokens
• Reese, Terry and Terry Reese both become reese terry
• Works best when you are unsure of the quality of your data
Token Types

Clustering
Changes
Clustered changes are queued and
stacked. Changes happen once all
edits have been set.
Clustered changes can be made by
group, across groups, or selected
items within a group

Additional
Clustering
Enhancements
Clustering support works on non-MARC data as well
◦ Specifically, tab delimited formatted data
◦ Supports the extraction of data based on the clusters selected
(rather than changing data)

MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh my!

Recommended

Recommended

More Related Content

Similar to MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh my!

Similar to MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh my! (20)

More from Terry Reese

More from Terry Reese (20)

Recently uploaded

Recently uploaded (20)

MarcEdit Shelter-In-Place Webinar 4: Merging, Clustering, and Integrations…oh my!