Slides from the NASIG 2018 Preconference

The Now and
Future MarcEdit: A
Day-long Workshop
TERRY REESE
THE OHIO STATE
UNIVERSITY
REESET@GMAIL.COM

Workshop format
Let’s find out what you want to learn
Let’s talk about changes to MarcEdit 7
Oh, remember that we all learn differently -- so please follow along, participate, engage in way
that makes the most sense for you
◦ I have examples, and I’m inviting folks to practice with me – but please follow along and engage in a way
that makes the most sense for you.

Schedule
8-10: Content
10-10:15: Break
10:15-noon Content
Noon – 1: Lunch
1-2:30: Content
2:30-2:45: Break
2:45 – 4:15: Content
4:15-5:00: Questions

Questions from
Attendees
Topics folks were interested in learning about:
◦ Fonts
◦ Unicode Processing
◦ RDA Processing
◦ Automation (more info about tasks)
◦ XML Processing
◦ Regular Expressions
◦ New MarcEdit 7 Features
◦ Merging Data from one record to another
Any other topics folks might be interested in?

Important notes
• Download from: http://marcedit.reeset.net/downloads/
Installation notes
• Any version of Windows that supports .NET 4.6.1+
• Fully supported on Linux (using Mono)
• Mac Version (native)
System Requirements MarcEdit 7.x
• No regular update cycle -- but generally, at least one
update per month
• I answer nearly every question I get about MarcEdit.
• MarcEdit Listserv (http://metis3.gmu.edu/cgi-
bin/wa?A0=MARCEDIT-L) is available for questions.
Upgrade/Support

THE ROADMAP
Some important changes coming to MarcEdit 7/MarcEdit Mac
◦ Continuing of deeper integrations with ILS vendors and OCLC
◦ Expansion of the Clustering toolset to enable more light-weight open
refine functionality like ingest from external web sources
◦ Continued parity between the Windows/MacOS versions
◦ Continued local help expansion
◦ Continued sharing options within the community (sharing expressions,
expertise)
◦ Expansion of Excel processing/creation
◦ Considering IOS support through the new MacOS bridge (though, this
would limit MacOS support to High Sierra and Mojave…so need to do
more research)

MarcEdit Evolution
MarcEdit 1.0-2.0 Main Window MarcEdit MARC Tools 1.0-2.0
MarcEdit 1.0-2.0 MarcEditor

MarcEdit @ 20
MarcEdit 7.x: Windows MarcEdit 3.x: MacOS

Today
MarcEdit is used almost everywhere
Is available for use on MacOS (10.10+), Linux, and
Windows (Vista+)
• Most Active User Community
• Windows Users: ~20,000
• MacOS Users: ~1,000
• Linux Users: ~150

MarcEdit 7
highlights
Lite-weight cluster has
been added directly into
the program
New way to process
XML/JSON data
A new linked data engine,
with support for locally
defined rdf vocabularies in
reconcillation
New task processing
Consolidated Z39.50/SRU
client
Added Editing Functions
•New Add/Delete Field Tools
(deduplication)
•Expanded Regular Expression
options
•Updated OCLC Integrations
Integrated Help

Today’s topics
Quick overview of
MarcEdit 7 Changes
Explore MarcEdit 7’s
new Clustering
Functionality
Working with non-
MARC data using
known and unknown
metadata formats
Explore MarcEdit 7’s
Linked Data Platform
Improved Task
Processing
Integration
opportunities with
• Alma or other ILS Systems
• OCLC
• Connexion

Today’s topics
MARC Specific
Tools
User
Questions (not
answered)
Regular
Expressions

Project Fluffy Muppet -- MarcEdit 7!
MarcEdit 7 was released over the U.S. Thanksgiving Holiday
The release:
1. Has been in development for close to 9 months over about 850 hours of development with ~20 testers
in 7 countries using 4 different MARC flavors providing direct feedback
2. Touched nearly every part of the program – when finished, the release updated a shade under
350,000 lines of code
3. Was tested against almost 20 million records
4. Is the first version of MarcEdit designed with Accessibility in mind
5. Is fast (I’m going to show you a few places where)

Let’s look at
what’s new
•Accessibility
•MarcEdit 7 includes an improved
font/sizing engine for improved
layout on different screen sizes
and resolutions
•All images are tagged with text
and accessibility via screen
readers or using the operating
system’s accessibility tooling
•Availability of themes, to allow
you to customize windowing and
contracts to ease eye strain
•Keyboard shortcuts (everywhere)
•Sound cues
•Window transparency
User
Center
Design

Let’s look at what’s new
More International
◦ MarcEdit 7 uses an intelligent machine translation service, providing an interface in close to 26
languages at this point
It’s Faster
◦ Lists have been virtualized (lower overhead)
◦ Pages load quicker
◦ Tasks have been super-charged
It’s leaner – in part because Windows XP support is no longer provided

Let’s look at what’s new
Program is easier to manage
◦ The program has 4 installation modes
◦ 32-bit Administrator and non-Administrator installation modes
◦ 64-bit Administrator and non-Administrator installation modes
◦ How do I choose?
◦ Depends on your needs:
◦ http://marcedit.reeset.net/downloads

MarcEdit 7’s new installer agent
Welcome to Project Fluffy Muppet, your friendly (and sometimes helpful) installation agent
◦ Hazel is there to help highlight important options, and make sure you can work with Unicode data by
making sure you have a Unicode font.
◦ Helps users migration data from previous versions
◦ Set important options
◦ Is Fun!

MarcEdit Language Preferences
MarcEdit allows you to set your preferred font for use with the User
Interface.
◦ *Important Note for Windows 10/Office 2016 User*
◦ Microsoft no longer provides the Arial Unicode MS font. This is the font MarcEdit
targets by default due to the coverage. As of Aug. 2016, I’m recommending users
download the noto fonts. These cover almost twice as many characters as the
Microsoft Arial Unicode font, is free, and open. You can read more about this here:
http://marcedit.reeset.net/replacement-unicode-fonts
MarcEdit allows you to set your preferred font size for use within the
program.

Getting Help when you need it
ADDITION OF DYNAMIC HELP TOPICS

THE OMNIBAR – SEARCH FOR HELP

TROUBLESHOOTING ISSUES

AND GETTING HELP

Task Processing changes
Introduction of a Task Broker
Introduction of Task Debugger
New Functions can be added to tasks
Ability to add Comments
Ability to turn off Task Broker

Let’s talk about
task changes
How they worked in MarcEdit 6

Task Changes
HOW THEY WORK IN MARCEDIT 7

What is the Task Broker
Task Broker was developed as a task evaluator
◦ For large tasks, the biggest performance hits where
◦ Tasks running on content when no changes were applicable
◦ Reading and writing files multiple times
Task Broker Enables Editing in Multiple Modes
◦ Mode 1: By File (legacy method)
◦ Mode 2: By Record (best for high volume tasks)

When you might override the task broker
Tasks with over 70 items and nearly all 70 items will edit every record – by file will be faster
Tasks when behavior is different between MarcEdit 6 and 7
◦ When this happens, it sometimes means your task isn’t constructed corrected; MarcEdit 7 is very strict
for processing sake and performance
◦ However, By File will put you back into the legacy mode, which should allow most older tasks to work in
cases where there are small incompatibilities
Are there many incompatibilities?
◦ There shouldn’t be. Mostly, these show up when users didn’t enter data and the interface in the older
method would automatically put default information in for the user. That doesn’t happen in the by
record approach.

So what does that mean to me?
In MarcEdit 6, the optimal task size was ~20 operations or lower. Once the operation count
began to get higher, the time that it would take to process data would become exponentially
slower. In MarcEdit 7, that performance line actually goes the other direction. The tool
processes records faster, and handles more records per second, the more task actions
completed.
Real-World Example
Library in Greece has a task list with over 1000 task actions. They would use this task to clean up
large portions of their database in one pass. Generally, this would mean processing ~300,000-
500,000 records at a time. In MarcEdit 6, this process would take as many as 10 hours to
complete. Using the MarcEdit 7 task processing, this process now takes less than 20 minutes.

But is it really faster?
Let’s compare processing using one record, but with a task list that uses north of 100 task
actions in MarcEdit 6 and MarcEdit 7.

Other
comparisons
•Comparing the
Extract Selected
Records Tooling
Virtual
Lists
•Loading large
data files
Loading
Files

MarcEdit 7 continues growing
Near term planned additions
◦ Continuing to wrap up some features in the MarcEdit MacOS 3 update
◦ Currently, the only missing functionality is related to some pre-and post processing of tasks when using ILS integrations
◦ New XML Profiling Wizard
◦ New plugins for individual record creation templates
◦ Support for HDT and linked data fragments (this is awesome stuff)
◦ Additional clustering algorithms

Addressing Unicode Normalization Fragmentation
Myth: Once our data is in UTF8, most of our encoding problems will be solved.
Reality is much more complicated. UTF8 comes in different normalizations, and those
normalizations impact how a character is represented. This has indexing implications, display
implications, and potentially, editing implications. And these problems can come from
anywhere. From OCLC (where search is normalized, but data isn’t), when editing (Operating
systems default to canonical normalizations), from data downloaded from other sources.

Addressing Unicode Normalization Fragmentation
How do I find these problems and why do I care?
Depending on the ILS system in use, changes in normalization can have significant issues. For ILS
systems that assume a canonical approach to Unicode normalization, data represented in as
decomposed data can impact display, almost always impacts indexing (i.e., the diacritic is lost in
indexing) and will almost always impact the ability to do global edits.
And you likely can’t see this problem until you run into it and look at the data under a hex editor.

Addressing
Unicode
Normalization
Fragmentation
How Does MarcEdit 7 Address this?
Through the Preferences:

Clustering in MarcEdit 7
How people clustered MARC data in the past
1. Export the fields considered for investigation into a tabbed delimited format
2. Import into OpenRefine
3. Cluster the data
4. Make Edits
5. Export the delimited data out of OpenRefine
6. Develop a process to merge the changed data back into MARC
If you need to have your data start or end up in MARC, working with OpenRefine can be challenging
because there isn’t a natural process to move between these two formats

MarcEdit has
OpenRefine
Options
EXPORT AND IMPORT DATA FROM
MARC INTO FORMATS OPENREFINE
CAN READ AND UNDERSTAND

Cluster Format Support
MARC support
◦ The tool provides direct manipulation of MARC data, with clustering options to edit at the field, or
subfield level, with clustering criteria set at the field/subfield level
◦ MARC data process supports any character encoding
Delimited Data Support
◦ New as of 3/2/2018 (on Windows/Linux; Mac coming)
◦ Allows for indexing by column
◦ Initial implementation limits editing to files with 100 or fewer columns
◦ Delimited formats supported:
◦ Tab Delimited
◦ Comma Delimited

Why not just use OpenRefine?
If you know how to use OpenRefine, please use it
◦ OpenRefine provides a much wider and rich set of functionality and designed specifically to help users
deal with data issues found within largely unstructured data
But MARC is a pain
◦ While OpenRefine has some MARC importing tools, the tooling specifically to MARC isn’t particularly
good or intuitive
◦ OpenRefine has some practical data (file size) limits when run on a desktop/client
MarcEdit’s clustering support was built with catalogers in mind
◦ To make working with library formats easy
◦ To minimize the need to migrate data between different formats (and risk losing information in the
conversion processes)

Clustering in
MarcEdit 7
MarcEdit’s built-in clustering tools
support native grouping and batch
editing and works well on file sizes of
a million records and smaller (can
work on large sets, but the larger the
file, the longer the cluster operation
takes)

Clustering Options
Clustering Algorithms
◦ Levenshtein Distance
◦ This algorithm is best for people, places, and subjects
◦ This algorithm builds clusters based on the number of positions/character difference between a word or phase
◦ This algorithm is generally faster
◦ Composite Coefficient
◦ This algorithm is best for highly variable data where a great deal of fuzziness is desired.
Token Types
◦ Default – normalized data tokens
◦ Reese, Terry and Terry Reese become reeseterry and terryreese
◦ Works best for spelling errors, form changes
◦ Fingerprint tokens
◦ Reese, Terry and Terry Reese both become reese terry
Works best when you are unsure of the quality of your data

Clustering
Changes
Clustered changes are queued and
stacked. Changes happen once all
edits have been set.
Clustered changes can be made by
group, across groups, or selected
items within a group

Additional Clustering Enhancements
Clustering support works on non-MARC data as well
◦ Specifically, tab delimited formatted data
◦ Supports the extraction of data based on the clusters selected (rather than changing data)

Delimited Text
Translator
Updated Excel Support Module
◦ Windows version no longer
requires Excel to be installed
◦ Mac and Linux now include Excel
support for the First Time
◦ Removes common errors related
to Excel OBDC components no
longer being available after
updates
◦ Oftentimes will correct/protect
users from Excel’s predictive
typing (no scientific notation when
looking at ISBNs, etc.)

MARCValidator
Enhancements
Integration with the HexEditor
New Layout Options
◦ By Record
◦ By Error
New Rules file options (addition of
UserDefined Data for output)

Adding UserDefined data to output

Auto Catalog
A new experimental tool (you are the first to see an hear about it) that allows the user to take a
picture of the publication data in a book, or a catalog card and generate a MARC record. The
tool will
1. Currently the data will look up the info in the U.S. Library of Congress or WorldCat if you have the
metadata api enabled
2. If it cannot find a record, it will auto generate one by OCR’ing your image, and developing a
MARC21 record.
3. If you have a collection of Cataloging Cards – you can process this data in batch

MARC Character Conversions
Supports moving between any known
Windows Characterset and MARC8.
Can be run from the Breaker/Maker – or as its
own standalone utility

Batch Record
Processor
Allows MarcEdit to process “lots” of
files.
Files can be processed against an
entire folder’s contents or by file type
Can utilize any built-in or derived XML
Function transformation

Merge Records Tool
Allows users to merge MARC data from two
files
Allows users to merge unique data, selected
data and all data.

Export Tab
Delimited
Records
Developed as a simple report
generating tool
Allows users to create spreadsheets
utilizing data found within the MARC
records.

Record Deduplication
MarcEdit provides a simple
dedup tool that can:
◦ Dedup on a defined control field
(any field)
◦ Dedup on a transaction field (or
using an additional transaction
field)
Output
◦ Removes all duplications and
saves the duplications to a file
◦ Prints just unique items within
the file (i.e., those without a
duplicate pair)

RDA Helper
I created the RDA Helper with the help of
PCC members 6 months prior to LC officially
transitioning to RDA
Purpose of the tool was to help facilitate
translation between Pre and AACR2 records
to RDA
Tools has been gradually expanded to
handle not only conversions, but
conversions of 880 data pairs as well.

RDA Helper
RDA Helpers has a number of Options
◦ You get to determine which fields the Helper processes when looking for abbreviation Expansion
◦ By default, fields checked are: =245$c, =260, 264, =300, =5
◦ Likewise, Abbreviations to expand can be found in the Edit List

RDA Helper
Special Instructions:
• 380 – Because this isn’t a controlled field, MarcEdit makes use a genre list at
the Library of Congress. This means that these values can be more general
than if done by a cataloger.
• 260/264 – Handles many different forms of the field. When the tool is
always set to generate a symbol, the tool will utilize MARC8 or UTF8
encoding based on the data.
• Qualifying information – moved qualifiers into a $q. Example: 020
$a02312123 (electronic) to 020 $a02312123$qelectronic
• Process the 502 – converts a dissertation note into a delimited format.
Example: 502 $aThesis (M.A.)--University College, London, 1969. to: 502 502
$gThesis$bM.A.$cUniversity College, London$d1969.
• Generate GMD (works on AACR2 encoded data or RDA Encoded data)
• Abbreviation expansion can be customized (using regular expressions) and
fields where abbreviations are run can be customized.

Don’t forget
MARC SQL Explorer
Merge Records
Dedup Records

MarcEdit:
crosswalking
design
MarcEdit model:
◦ So long as a schema has been mapped to MARCXML, any
metadata combination could be utilized. This means that no
more than two tranformations will ever take place. Example:
MODS  MARCXML  EAD

MarcEdit
Crosswalking
model
MARC21XML
EAD
FGDC
MODSMARC
Dublin
Core

MarcEdit: Crosswalks
for everyone
What’s MarcEdit doing?
◦ Facilitates the crosswalk by:
1. Performing character translations
(MARC8-UTF8)
2. Facilitates interaction between binary
and XML formats.

XML Function Wizard
The wizard was created to help fill a gap – to enable metadata
crosswalking when a user doesn’t have a lot of expertise
building XSLT or Xquery transformations

OAI Harvesting
MarcEdit’s OAI Harvester can run in two modes
◦ User Initiated
◦ Scheduled
Let’s look at both!

OAI Harvesting – User Initiated
Harvesting supports the following verbs
◦ GetRecord
◦ ListRecords
◦ ResumptionToken
Any metadataPrefix can be accommodated, but by
default, the tool has XSLT crosswalks for:
◦ MARCXML
◦ OAIMARC
◦ Dublin Core
◦ MODS

OAI Harvesting -- Scheduled
Using scheduler on Windows, or cron on Linux,
or whatever the equivalent is on MacOS, you
can create Harvesting Jobs and schedule them
for regular harvest

Changes made to
Replace All
1
Add/Delete
Fields
2
Task
Management
Changes
3
OAI
Harvesting
Changes
4

Replace All
Changes
Highlights:
◦ Conditionals can now be run in
multi-line mode
◦ NOT has been added as a
conditional
◦ Addition of an Exact Word Match

Add/Delete
Field Highlights
New layout putting options in groups
relative their function
NOT functionality added to add field
Expansion of Remove Duplication data
or data by position
◦ Ability to protect field preference
on duplication

Task
Management
New to the Editor
◦ Ability to add comments
◦ Batch move items in the list
◦ Batch Delete task actions
◦ Batch copy task actions
◦ Batch paste task actions
◦ Print tasks
◦ Debug Tasks
◦ Dynamically generate new task
lists

OAI Harvester
Changes
Highlights
◦ New batch download mode
◦ Ability to change the user-agent
(to work around BePress filtering)
◦ Unicode Normalization inclusion
◦ Updated resumptionToken
processing (to encoding changes
related to Dspace)

Working with
Linked Data
In MarcEdit

Objects not strings
Probably the biggest reason people talk about linked data is the notion of moving from strings to
objects
Strings

Objects not strings
Probably the biggest reason people talk about linked data is the notion of moving from strings to
objects
Objects

Objects Not Strings
URIs provide actionable data
◦ Controlled terms can be updated without user intervention (generally)
◦ And URIs can provide access to more information
◦ I.E. – a URI to VIAF provides access not just to author information, but to all their related works and collaborators as well.

So why aren’t
we doing this
already?
1
Changing Strings to Objects is
hard and expensive
2
We have some folks, like
OCLC, that could be in a
position to help us, but our
current systems are not setup
to use (and in some cases)
store the data.
3
Many of our controlled
vocabularies are not designed
to support reconciliation
work
•And those that are aren’t
production ready
•Or – are proprietary

So what can we do right now?
A lot –
◦ Many of the large national services are making resources and infrastructure available to enable libraries
to begin doing this work
◦ OCLC has been largely supportive, and provides their own tools with output linked data content
◦ We can start lobbying our systems to not just store the data, but make use of the information when
provided
◦ We can start the reconciliation process (because this process takes time)

MarcEdit and Linked Data
MarcEdit 6 and 7 include a linked data plaftform -- this is an integration platform that enables
MarcEdit to work with various linked data services, and provides a way to build new services
around this functionality
◦ Designed to support RDF, JSON-LD, SPARQL – and a wide range of library specific services currently
providing one off access to controlled data
◦ The framework has been utilized in MarcEdit for the development of a toolset called MARCNext

MARCNext
These are Experimental services that allow catalogers to play with their data and visualize it
through the BibFrame lens – as well as begin the process of turning strings to objects.

Linked Data Tool
Linked data tool enables reconciliation services
Works from a rules file, which enables users to
customize the output provided
◦ MarcEdit 7 provides a rules file optimized for
MARC21, but I have rules files being tested for
a number of MARC formats (including
UNIMARC)
Currently supports the insertion of $0 and $1 into
bibliographic and authority data
Includes support from ~25 remote linked data
endpoints
Can use local rdf files as locally mounted SPARQL
stores
Allows for targeted, or automatic processing

Managing Linked Data Sources
All data is managed via a Rules File
◦ Which can be edited via the GUI
◦ OR as an XML file

Does this currently scale?
I get asked this question, because the Library of Congress actively throttles data request made
against their service. So too do many other service providers. They have to, it’s a method of self
preservation. When I test reconciled Ohio State’s entire database (~6 million records), I
estimated that I would end up making on the low end, 48,000,000 requests, just to the Library of
Congress. Over a very short period – that’s a lot of requests, and can overwhelm their services.
However…
◦ I work closely with many of the large data providers, and they give MarcEdit some leeway because:
◦ MarcEdit follows some established patterns…in LCs case, they can provide an HTTP status code that let’s the application know that
their service is under load, and MarcEdit will start slowing down requests for a specified time period.
◦ MarcEdit does its own internal caching – this way an item is only retrieved once per reconciliation session. Using this method, I can
likely cut the number of requests to a service like LC by over a 1/3 or more. In fact, the more data that’s processed, the faster it
goes and the less requests it has to make to the source vocabulary

We can
build new
services
USING LINKED DATA TOOLS FOR
HEADING VALIDATION

Validate Headings
How it works
◦ Working directly with the U.S. Library of Congress – MarcEdit queries the NACO and SACO headings
directly
◦ Returning information about URIs and variants/changes
◦ MarcEdit then generates a report, automatically corrects headings (when possible) and can generate
brief authority records or downloads the existing authority record

SPARQL Functionality
SPARQL Searches can:
◦ be run against remote servers
◦ Return multiple formats
◦ Save Results
◦ Be run against local repositories

Local SPARQL Query
Using LC Children’s Subject Headings
PREFIX ns0:<http://www.loc.gov/mads/rdf/v1#>
Select * where {
?name ns0:authoritativeLabel ?term .
Filter (lcase(str(?term)) = "dinosaurs")
}

Questions
Again – I would like to hear from you?
I’ve been working with members of the
PCC task force looking at how we embed
linked data into MARC records (and
outside of MARC records), and I’ve been
actively building these tools into
MarcEdit (both for research and
production).
How would you like to work with linked
data recordset in your library?
What could MarcEdit do to make this
easier for you?

ILS Integration
ILS Integration currently supports direct integration with
Koha, Alma, and a local option.
Are other integrations possible?
◦ http://blog.reeset.net/archives/2133

Let’s talk about ALMA
Integration

How MarcEdit Works with Alma
MarcEdit works through the following API endpoints:
◦ https://developers.exlibrisgroup.com/alma/apis/bibs
◦ Because the API is rate limited (i.e., you can only process so many transactions concurrently through the
API, and all Alma operations use the API), MarcEdit limits API processes to a single thread. It takes a
little longer, but eliminates the possibility that using MarcEdit to automate workflows will bring down
your system because the tool is trying to communicate with the system too quickly.
This this API, MarcEdit can:
◦ Edit holdings data (and Holdings Records)
◦ Create and Update bibliographic data
◦ Extract Records
◦ Though discovery should be done via Z39.50 or SRU (which is preferred)

Working with OCLC Connexion
https://youtu.be/a7Cen0gxFCw?list=PLrHRsJ91nVFScJLS91SWR5awtFfpewMWg

Call Number
Classifications
Integrates with OCLC Classify Service
◦ Provides LC or Dewy Numbers
◦ Processes data based on OCLC
Number, ISBN, ISSN, Author/Title
Pair
◦ Generates Call Numbers based on
the most widely used call number

Working with
OCLC’s
Metadata API
MARCEDIT CAN WORK DIRECTLY
WITH WORLDCAT VIA THE
METADATA API.

MarcEdit: Batch WorldCat Holdings Management

MarcEdit: Batch Bibliographic Record Upload

More
Information
OCLC’s Developer Network:
◦ http://oclc.org/developer/
OCLC Metadata API Documentation:
◦ http://oclc.org/developer/services/worldcat-metadata-api
Notes on MarcEdit Integration:
◦ http://blog.reeset.net/archives/1245
C# OCLC API Library
◦ https://github.com/reeset/oclc_api

MarcEdit Regular
Expression Primer

MarcEdit Regular Expression Support
Functions that presently support regular expressions
◦ Delete Field
◦ Edit Field
◦ Copy Field
◦ Swap Field
◦ Build New Field
◦ Validation
◦ Extract/Delete Selected Records

Expression Scope
Deciding which function to use depends on the scope of data needing to be evaluated
◦ Add/Delete Field – Regular Expressions have access to the entire field (from the “=“ to the end of line
(eol)
◦ Edit Subfield – Regular Expressions have access to the subfield code, to the end of the subfield
◦ Edit Field – Regular Expression has access to all subfield data, but *not* indicator data
◦ Edit Indicators – only access to indicator data
◦ Copy Field – Regular Expression has access to indicator data + all subfield data
◦ Replace Function – Regular Expression has access to all record data

Microsoft’s Regular Expression language
Concepts:
◦ Character escapes
◦ Anchors
◦ Character classes
◦ Grouping
◦ Qualifiers
◦ Substitutions
Let’s open Regular Expression Language - Quick Reference.html or
https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx

How we use Regular Expressions in
MarcEdit
Your most important parts of the regular expression language are:
1. Character escapes: dwrn$
2. Character Classes [] & [^]
3. Grouping Elements ()
4. Anchors: ^$
5. Quantifiers: *?+{#}
6. Substitutions: $#

Examples
Looking at regex_example.mrk using the replace function:
◦ Add a period to the 500 if it is missing
◦ Update the 300 to reflect electronic information
◦ Split the 856 into two fields, breaking on the $u.

Examples 1
◦ Add a period to the 500 if it is missing
◦ Find What: (=500 ..)(.*[^W]$)
◦ Replace With: $1$2.
Explanation:
◦ (=500 ..)
◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The
two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the
periods.
◦ (.*[^W.]$)
◦ Take any characters, and match on a field where the last character in the field isn’t a period.

Examples 2
Add online resource information to the 300 field
Example:
◦ Change: 300 $a 32 p.
◦ To: 300 $a1 online resource (32 p.)
Explanation:
◦ (=500 ..)
◦ Searches for the 500 field. We leave two blanks because there are always 2 blank characters as part of the mnemonic format. The
two periods which stand for any character. If we want to search for exact indicators, you’d place those values rather than the
periods.
◦ (?<one>$a)([^$]*)
◦ Capture the $a and then all data in the subfield until you get to the next subfield (if there is one)

Example 3
Split the 856 into two fields, breaking on the $u.
◦ Find What: (=856.{4})($u.*[^$])($u.*)
◦ (=856.{4})
◦ Matches the 856 field
◦ ($u.*[^$])
◦ Match $u, but stop at the end of the subfield
◦ ($u.*)
◦ Match reminder of field
◦ Replace With: $1$2n=856 41$3

lcase/ucase
MarcEdit’s regular expression engine includes to extension functions for dealing with case
switching of characters.
◦ lcase & ucase
◦ Usage: (=450.{4})($a.)(.*)
◦ $1$2lcase($3)
◦ Example: Find the 500 with all upper case characters and convert the case of all values but the first
letter in the sentence to lower case.

Example (lcase)
Find the 500 with all upper case characters and convert the case of all values but the first letter
in the sentence to lower case.
◦ Find What: (=500.{4})($a.)([A-Z .]*)
◦ Replace With: $1$2lcase($3)

Multi-Field Replacements
By default, MarcEdit handles one field at a time when doing regular expressions.
◦ However, when you need to do evaluations against multiple fields, you can by adding /m to the end of
your replacement in the Replace Function in the MarcEditor
◦ This is a special function added to the MarcEdit regular expression engine

Example
Using regex_example.mrk
Changing video disc to blue-ray in the 300 if the 538 is marked as blue-ray

Placeholder
Are there specific editing tasks that folks are interested in?
We can talk about these now

Slides from the NASIG 2018 Preconference

More Related Content

What's hot

Similar to Slides from the NASIG 2018 Preconference

More from Terry Reese

Recently uploaded

Slides from the NASIG 2018 Preconference