Large Scale Data
Clean-ups & Challenges
for the Library
Ksenija Mincic-Obradovic
Asia Pacific Metadata Advisory Board Meeting
3-4 August 2014
Pattaya, Thailand
“Data cleaning is considered as a main
challenge in the era of big data, due to the
increasing volume, velocity and variety of
data.”
(Tang, 2014)
Data cleaning process:
• Identifying data errors
• Repairing data errors
• Preventing data errors
In LIBRARIES,
we might have to clean up data to:
• Remove ceased e- titles
• Update changed URLs
• Enable DDA/PDA purchasing
• Perform gap analysis
• Enable system migrations
• Enable system integrations
• Improve display in ILS
Main types of mistakes in e-book
records
• MARC21 errors
– E.g.: coding, wrong indicators, wrong characters…
– Consequence: wrong indexing, records rejected…
• Wrong identifiers
– 001, 010, 020/022, 035, 856z
– Consequence: wrong matching, duplicates…
• Mistakes in description fields
– E.g. wrong title, wrong author,
– Consequence: bad display, faceting doesn’t work…
• Lack of URLs
– Consequence: e-book cannot be accessed
Example 1:
Fixing MARC21 errors in
vendors/publishers files with e-
book records
• Use programmes such as MARCReport and
MarcEdit to identify errors
• Use MARCGlobal and MARCEdit to fix data
• Load file in the local catalogue
Example 2:
Updating the NUC
(National Union Catalogue)
• New Zealand national level project
• Started in 2008
• Automated way of reporting changes to the
library holdings (additions and deletions) to
the NUC
• Using OSMOSIS, a software tool, developed by
the TMQ (Fla)
Identifiers for
Matching Bibliographic Data
• 001 - Control Number
• 010 - Library of Congress Number
• 020/022 - ISBN/ISSN
• 035 - System Control Number
• 856 $z e-book SpringerLink
OSMOSIS Report (11/2014
020
(ISBN)
020
(ISBN)
Recommendations
• Check and clean data in vendor files before
loading to your catalogue.
• Follow national and international standards in
all aspects.
• Perform regular database maintenance.
• Encourage cooperation between libraries and
vendors/publishers.
References
Beall, J. (2005). 10 ways to improve data quality: with a
coordinated effort, your library can make significant progress in
cleaning up its online catalog. American Libraries, 36(3), 36+.
Retrieved from
http://go.galegroup.com.ezproxy.auckland.ac.nz/ps/i.do?id=GAL
E%7CA139719467&v=2.1&u=learn&it=r&p=AONE&sw=w&asid=
8bc9b1a0d979542543f18fc581b25da2
Rahm, E. (2004) Data Cleaning: Problems and Current
Approaches . In Galindo, F., Takizawa, Makoto, & Traunmü ller,
R. (2004). Database and expert systems applications 15th
International Conference, DEXA 2004, Zaragoza, Spain, August
30-September 3, 2004 : Proceedings (Lecture notes in computer
science ; 3180). Berlin ; New York: Springer.
Tang, N. (2014). Big Data Cleaning. In Chen, L. (2014).
Web technologies and applications : 16th Asia-Pacific Web
Conference, APWeb 2014, Changsha, China, September 5-7,
2014. Proceedings (Lecture notes in computer science ; 8709).
Image credits
• http://www.bluewolfconsulting.co.uk/blog/dat
a-doesn-t-have-be-dirty-four-letter-word
• https://www.flickr.com/photos/epublicist/8718
123610
• http://www.dreamstime.com/
Thank you
ขอบคุณ
Ksenija Mincic-Obradovic
k.obradovic@auckland.ac.nz

Large Scale Data Clean-ups & Challenges for the Library

  • 1.
    Large Scale Data Clean-ups& Challenges for the Library Ksenija Mincic-Obradovic Asia Pacific Metadata Advisory Board Meeting 3-4 August 2014 Pattaya, Thailand
  • 2.
    “Data cleaning isconsidered as a main challenge in the era of big data, due to the increasing volume, velocity and variety of data.” (Tang, 2014)
  • 3.
    Data cleaning process: •Identifying data errors • Repairing data errors • Preventing data errors
  • 4.
    In LIBRARIES, we mighthave to clean up data to: • Remove ceased e- titles • Update changed URLs • Enable DDA/PDA purchasing • Perform gap analysis • Enable system migrations • Enable system integrations • Improve display in ILS
  • 5.
    Main types ofmistakes in e-book records • MARC21 errors – E.g.: coding, wrong indicators, wrong characters… – Consequence: wrong indexing, records rejected… • Wrong identifiers – 001, 010, 020/022, 035, 856z – Consequence: wrong matching, duplicates… • Mistakes in description fields – E.g. wrong title, wrong author, – Consequence: bad display, faceting doesn’t work… • Lack of URLs – Consequence: e-book cannot be accessed
  • 6.
    Example 1: Fixing MARC21errors in vendors/publishers files with e- book records • Use programmes such as MARCReport and MarcEdit to identify errors • Use MARCGlobal and MARCEdit to fix data • Load file in the local catalogue
  • 10.
    Example 2: Updating theNUC (National Union Catalogue) • New Zealand national level project • Started in 2008 • Automated way of reporting changes to the library holdings (additions and deletions) to the NUC • Using OSMOSIS, a software tool, developed by the TMQ (Fla)
  • 11.
    Identifiers for Matching BibliographicData • 001 - Control Number • 010 - Library of Congress Number • 020/022 - ISBN/ISSN • 035 - System Control Number • 856 $z e-book SpringerLink
  • 12.
  • 13.
  • 14.
  • 15.
    Recommendations • Check andclean data in vendor files before loading to your catalogue. • Follow national and international standards in all aspects. • Perform regular database maintenance. • Encourage cooperation between libraries and vendors/publishers.
  • 17.
    References Beall, J. (2005).10 ways to improve data quality: with a coordinated effort, your library can make significant progress in cleaning up its online catalog. American Libraries, 36(3), 36+. Retrieved from http://go.galegroup.com.ezproxy.auckland.ac.nz/ps/i.do?id=GAL E%7CA139719467&v=2.1&u=learn&it=r&p=AONE&sw=w&asid= 8bc9b1a0d979542543f18fc581b25da2 Rahm, E. (2004) Data Cleaning: Problems and Current Approaches . In Galindo, F., Takizawa, Makoto, & Traunmü ller, R. (2004). Database and expert systems applications 15th International Conference, DEXA 2004, Zaragoza, Spain, August 30-September 3, 2004 : Proceedings (Lecture notes in computer science ; 3180). Berlin ; New York: Springer. Tang, N. (2014). Big Data Cleaning. In Chen, L. (2014). Web technologies and applications : 16th Asia-Pacific Web Conference, APWeb 2014, Changsha, China, September 5-7, 2014. Proceedings (Lecture notes in computer science ; 8709).
  • 18.
    Image credits • http://www.bluewolfconsulting.co.uk/blog/dat a-doesn-t-have-be-dirty-four-letter-word •https://www.flickr.com/photos/epublicist/8718 123610 • http://www.dreamstime.com/
  • 19.