Topics:
* merging records
* building clustering tools
* moving marc data in and out of openrefine
* Integrations (oclc and alma)
Recording on Youtube: https://youtu.be/2pPru42ShqY
3. Topics
This may be a bit of an eclectic session – as I’m interested
in covering a range of topics that are coming up during
this period of remote work.
merging records
Integrations (oclc and alma)
building clustering tools
moving marc data in and out of openrefine
4. Merge Records
Tool
Allows users to merge MARC data
from two files
Allows users to merge unique data,
selected data and all data.
5. Merge Records Tool
Merge Record functions with two modes
1. Merging data across two files
◦ Use Case: Merging one or more fields from one file into another based on a match point
2. Merging Duplicate data into a single record set
◦ Use Case: Source file has multiple duplicates or versions of records. The goal is to merge all like
records into one using a match point.
6. Merging Records
Merging Data Across Two Files
◦ This mode is activated by:
◦ Source File and Merge File are different
7. Merging Records
Merging Data: Consolidating data in the same
file
◦ This mode is activated by:
◦ Source file and Merge File are the Same
8. Merging Records
How merging occurs
◦ Matches occur by setting the Identifier
◦ Identifier is either a field or field + subfield
◦ Merges can be multiple fields – delimited by a pipe. For
example: 001|901$a
◦ What if there is no common match point?
◦ You can use the MARC21 Option
◦ Unicode Encoded option chances how field data
is evaluated. By default, values are evaluated as
binary data – but by selecting this option, data
will be evaluated as characters. This is important
when working with multibyte languages
10. How MarcEdit Works with Alma
• https://developers.exlibrisgroup.com/alma/apis/bibs
• Because the API is rate limited (i.e., you can only process so many transactions concurrently through
the API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It
takes a little longer, but eliminates the possibility that using MarcEdit to automate workflows will
bring down your system because the tool is trying to communicate with the system too quickly.
MarcEdit works through the following API endpoints:
• Edit holdings data (and Holdings Records)
• Create and Update bibliographic data
• Extract Records
• Though discovery should be done via Z39.50 or SRU (which is preferred)
This this API, MarcEdit can:
11. Working with OCLC Connexion
https://youtu.be/a7Cen0gxFCw?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg
12. Call Number
Classifications
Integrates with OCLC Classify Service
◦ Provides LC or Dewy Numbers
◦ Processes data based on OCLC
Number, ISBN, ISSN, Author/Title
Pair
◦ Generates Call Numbers based on
the most widely used call number
17. Cluster Format
Support
MARC support
• The tool provides direct manipulation of MARC data,
with clustering options to edit at the field, or subfield
level, with clustering criteria set at the field/subfield
level
• MARC data process supports any character encoding
Delimited Data Support
• New as of 3/2/2018 (on Windows/Linux; Mac coming)
• Allows for indexing by column
• Initial implementation limits editing to files with 100
or fewer columns
• Delimited formats supported:
• Tab Delimited
• Comma Delimited
18. Why not just
use
OpenRefine?
If you know how to use OpenRefine, please use it
• OpenRefine provides a much wider and rich set of functionality
and designed specifically to help users deal with data issues found
within largely unstructured data
But MARC is a pain
• While OpenRefine has some MARC importing tools, the tooling
specifically to MARC isn’t particularly good or intuitive
• OpenRefine has some practical data (file size) limits when run on a
desktop/client
MarcEdit’s clustering support was built with catalogers
in mind
• To make working with library formats easy
• To minimize the need to migrate data between different formats
(and risk losing information in the conversion processes)
19. Clustering in
MarcEdit 7
MarcEdit’s built-in clustering tools
support native grouping and batch
editing and works well on file sizes of
a million records and smaller (can
work on large sets, but the larger the
file, the longer the cluster operation
takes)
20. Clustering
Options
• Levenshtein Distance
• This algorithm is best for people, places, and subjects
• This algorithm builds clusters based on the number of
positions/character difference between a word or phase
• This algorithm is generally faster
• Composite Coefficient
• This algorithm is best for highly variable data where a great deal
of fuzziness is desired.
Clustering Algorithms
• Default – normalized data tokens
• Reese, Terry and Terry Reese become reeseterry and terryreese
• Works best for spelling errors, form changes
• Fingerprint tokens
• Reese, Terry and Terry Reese both become reese terry
• Works best when you are unsure of the quality of your data
Token Types
21. Clustering
Changes
Clustered changes are queued and
stacked. Changes happen once all
edits have been set.
Clustered changes can be made by
group, across groups, or selected
items within a group