Data Clean-up: Is There A Better Way?
Upcoming SlideShare
Loading in...5
×
 

Data Clean-up: Is There A Better Way?

on

  • 5,766 views

Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University ...

Moving data about library resources among systems often engenders data cleanup processes. What is the best way to clean up data? Which tools and skills for non-programmers can help? See how University of California, Riverside Libraries tackle this issue, then share tips and techniques in an open forum.

Statistics

Views

Total Views
5,766
Views on SlideShare
5,766
Embed Views
0

Actions

Likes
0
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Creator Camille Pissarro Title Peasant Woman Digging Date 1882 Material oil on canvas Measurements 65 x 54 cm Description Inscription: signed and dated lower left: C. Pissarro. 82 Repository Private Collection ARTstor Collection Art, Archaeology and Architecture (Erich Lessing Culture and Fine Arts Archives) ID Number 40-06-12/42 Source Image and original data provided by Erich Lessing Culture and Fine Arts Archives/ART RESOURCE, N.Y. http://www.artres.com/c/htm/Home.aspx http://www.artres.com/c/htm/TreePfLight.aspx?ID=LES Rights Photo Credit: Erich Lessing/ART RESOURCE, N.Y.
  • Creator Author: Abu'l Izz Isma'il al-Jazari; Copyist: Farkh ibn `Abd al-Latif Culture Islamic Title Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari Work Type Illustrated Manuscript, Folio Period Mamluk period (1250-1517) Date A.H. 715/ 1315-16 Location Made in: Syria Material Ink, colors, and gold on paper Measurements H. 11 13/16 in. x W. 7 3/4 in. (30 cm x19.7 cm) Credit Line The Metropolitan Museum of Art, Bequest of Cora Timken Burnett, 1956 (57.51.23) Image Copyright Notice Image © The Metropolitan Museum of Art Repository The Metropolitan Museum of Art http://www.metmuseum.org ARTstor Collection Metropolitan Museum of Art - Images for Academic Publishing ID Number 8718 Source Data From: The Metropolitan Museum of Art Rights This image was provided by The Metropolitan Museum of Art. Contact information: Image Library, The Metropolitan Museum of Art, 1000 Fifth Avenue, New York, NY 10028, (212) 396-5050 (fax), Scholars.License@MetMuseum.org Image © The Metropolitan Museum of Art
  • Creator John Vink Title CAMBODIA. Pourk (Siem Reap). 8/02/2005: Chantiers Ecoles runs skills development programs aimed at un-or less-educated villagers in Siem Reap province. A silkfarm trains villagers, here seen sorting cocoons, in all the different techniques in order to reestablish an age-old tradition in high quality silk weaving. Date 2005 Subject Pourk Cambodia ARTstor Collection Magnum Photos ID Number PAR287761.jpg Source Image and original data provided by Magnum Photos
  • Creator Miguel Rio Branco Title BRAZIL. 1985. Serra Pelada (Hill of Gold) gold mine, proved one of the richest deposits of alluvial gold ever found and it is considered to be one of the largest in the world. It is located 270 miles south of Belem on the Amazon delta. Serra Pelada mine is controlled by the State which distributes the "barrancos", 6 meters square soil each one, to the various owners "garimperos" (gold diggers or fortune hunters) according to their seniority. The "garimperos" are only allowed to dig vertically into the earth in order to avoid encroaching onto the other "barranco". Whenever gold is found is putted into sacks which are supervised at the edge of the "barranco". Each worker, "mudhog", is allowed to chose a sack as a premium for his work. Then the sacks are taken to a sifting and sorting area belonging to the owner. There are over 50,000 "garimperos" with their workers who enter to the gold mine every day. The mine is only open the dry season from September through January bescause of the nature of the landscape machinery. Date 1985 ARTstor Collection Magnum Photos ID Number PAR52903.jpg Source Image and original data provided by Magnum Photos http://www.magnumphotos.com/ Rights ©Miguel Rio Branco / Magnum Photos
  • Creator John Murphy American, 1888-1968, North American; American Title Men Digging Work Type Prints Date unknown Material wood engraving Measurements sheet: 12 3/16 x 14 11/16 in. (31 x 37.3 cm) Description Full View Repository Davis Museum and Cultural Center, Wellesley College Wellesley, Massachusetts, USA Museum purchase 1993.4 http://www.davismuseum.wellesley.edu/ ARTstor Collection Davis Museum and Cultural Center Collection (Wellesley College) Formerly in The AMICO Library ID Number DMCC.1993.4 Source Data From: Davis Museum and Cultural Center, Wellesly College Rights This image was provided by Davis Museum and Cultural Center, Wellesley College. Contact information: Jim Olson, Coordinator of Technology, Davis Museum and Cultural Center, Wellesley College, 106 Central Street, Wellesley, MA 02481, (781) 283-3234 (ph), (781) 283-2064 (fax), jolson@wellesley.edu.
  • Title Book of Hours. Use of Tournai. Folio #: fol. 007r Work Type Manuscript Date c. 1440 Material parchment School Flemish Description Calendar. March: peasant digging with mattock. Miniatures attributed to the Master of Guillebert of Metz. Repository Bodleian Library, University of Oxford http://www.bodley.ox.ac.uk/ Accession Number Shelfmark: MS. Rawl. liturg. e. 14 ARTstor Collection Manuscripts and Early Printed Books (Bodleian Library, Oxford University) ID Number Rawl.liturg.e.14_roll314.1_frame3 Source Image and original data provided by the Bodleian Library, University of Oxford. Rights Copyright Bodleian Library, University of Oxford.
  • ASCII includes definitions for 128 characters: 33 are non-printing control characters (now mostly obsolete) that affect how text and space is processed; 94 are printable characters, and the space is considered an invisible graphic.The most commonly used character encoding on the World Wide Web was US-ASCII until 2008, when it was surpassed by UTF-8. http://en.wikipedia.org/wiki/ASCII 12/22/2009
  • Creator unknown Title Sorting mail. Work Type glass negatives Date n.d. (circa 1885-1900) Material black and white photograph Measurements [no print] Description Inscription: [See individual photos for captions.] Related Item http://nrs.harvard.edu/urn-3:RAD.SCHL:sch00140 Subject United States Postal Service Animals Horse-drawn vehicles Horses Postal service Wagons ARTstor Collection The Schlesinger History of Women in America Collection Source Photograph Number: MC212-D-99 Arthur and Elizabeth Schlesinger Library on the History of Women in America (Radcliffe Institute for Advanced Study, Harvard University) Data From: The Schlesinger History of Women in America Collection Folder Number: D Folder Title: Marian Clark Nichols, 1873-1953: Civil service reform activities: Lantern slides [and glass plate negatives] re: civil service reform, including Thomas Nast cartoons, state and federal statisticas and legislation, and pictures of workers Collection Number: MC212 Collection Name: Emerson, Eugenie Homer, 1854-1940 Collection Title: Papers, 1806-1953 (inclusive) Rights This image has been made available by the Schlesinger Library, Radcliffe Institute, Harvard University solely for noncommercial educational and scholarly purposes. Your use of this image is restricted to those permitted uses specified in the ARTstor Digital Library Terms and Conditions of Use (http://www.artstor.org/info/about/terms_conditions.jsp). To request permission for any other use, please contact the Schlesinger Library at slref@radcliffe.edu. Download Size 1024,1024
  • Title Bayeux Tapestry; scene 35,a: Duke William Orders Ships to Be Built; detail of chopping trees Work Type textile Date c. 1070-80 Material wool embroidery on linen Measurements 53 cm x 69 m Style Period Romanesque Repository Centre Guillaume le Conquérant, Bayeux, France ARTstor Collection Art, Archaeology and Architecture (Erich Lessing Culture and Fine Arts Archives) ID Number 31-01-01/23 Source Image and original data provided by Erich Lessing Culture and Fine Arts Archives/ART RESOURCE, N.Y. http://www.artres.com/c/htm/Home.aspx http://www.artres.com/c/htm/TreePfLight.aspx?ID=LES Rights Photo Credit: Erich Lessing/ART RESOURCE, N.Y. Please note that if this image is under copyright, you may need to contact one or more copyright owners for any use that is not permitted under the ARTstor Terms and Conditions of Use or not otherwise permitted by law. While ARTstor tries to update contact information, it cannot guarantee that such information is always accurate. Determining whether those permissions are necessary, and obtaining such permissions, is your sole responsibility. Download Size 1024,1024
  • Creator Arnold, Eve Title Hsishuang Panna Weeding Date 1979 Location China Subject China Agriculture Farmers Mountains Photography--20th C. A.D documentary books farms ARTstor Collection ARTstor Slide Gallery Source Data from: University of California, San Diego Download Size 1024,1024
  • Culture Etruscan Title Shoes (Sandals) Work Type woodwork Date 6th century BCE Material wood, bronze Description from Bisenzio Olmo Bello Tomb XVIII Repository Museo nazionale di Villa Giulia ARTstor Collection Italian and other European Art (Scala Archives) Source Image and original data provided by SCALA, Florence/ART RESOURCE, N.Y. http://www.artres.com/c/htm/Home.aspx http://www.scalarchives.com Rights (c) 2006, SCALA, Florence / ART RESOURCE, N.Y.
  • Culture German (Nuremberg) Title Tournament Book Date late 16th century Material Pen and colored wash on paper Credit Line The Metropolitan Museum of Art, Rogers Fund, 1922 (22.229) Image Copyright Notice Image © The Metropolitan Museum of Art Repository The Metropolitan Museum of Art http://www.metmuseum.org ARTstor Collection Metropolitan Museum of Art - Images for Academic Publishing ID Number 3389 Source Data From: The Metropolitan Museum of Art Rights This image was provided by The Metropolitan Museum of Art. Contact information: Image Library, The Metropolitan Museum of Art, 1000 Fifth Avenue, New York, NY 10028, (212) 396-5050 (fax), Scholars.License@MetMuseum.org Image © The Metropolitan Museum of Art
  • Creator Anonymous Artists Title Digging for Coal Upon Seeing a Swallow Guarantees Freedom from Fever and Headaches for a Year Series Title BUCH DER TUGEND | Book of Virtue Date 1486 Technique woodcut Description SUPERSTITION ARTstor Collection The Illustrated Bartsch ID Number 8586.1486/154 SCHRAMM, 23.639 Source The Illustrated Bartsch. Vol. 85, German Book Illustration before 1500: Anonymous Artists, 1484-1486 Retrospective conversion of The Illustrated Bartsch (Abaris Books) by ARTstor Inc. and authorized contractors Download Size 1024,1024

Data Clean-up: Is There A Better Way? Data Clean-up: Is There A Better Way? Presentation Transcript

  • Data Clean-Up: Is there a Better Way? Margaret Hogarth ER&L, 2/2/2010 © Camille Pissaro, 1882, “Peasant Woman Digging,” ARTstor 40-06-12/42
  • To Be Covered:
    • Commonalities
    • Issues
    • Deficits
    • Excel
    • Access
    • MarcEdit
    • Global Update
    • Data Quality Policy
    • Suggestions?
    © The Metropolitan Museum of Art, “Book of the Knowledge of Ingenious Mechanical Devices by al-Jazari,” ARTstor
  • Commonalities
    • Data sources
    • Tools
    • Data clean-up
    • Capture and use issues
    • Needed technical skills
    • bonobokids.com
  • General Approach
    • Keep original data original; copy to a new worksheet
    • Save frequently
    • Use meaningful file/worksheet names
    • Folder of “final” documents
    © John Vink, 2005, "CAMBODIA," ARTstor PAR287761
  • Issues
    • Import issues
    • System limitations
    • Dirty data
    • Non-standardized data
    • Standardized but variable data
    • Application-related issues
    • Others?
    © Miguel Rio Branco, 1985, "BRAZIL," ARTstor PAR52903
  • Deficits
    • Time
    • Staff
    • Budget
    • Skills
    • Development
    • Confidence
    • Big-picture view
    • Systems
    © John Murphy, “Men Digging," ARTstor DMCC.1993.4
  • System Limitations
    • Field character limits
    • Report field limits
  • Import => Excel, Zeros
    • Problem:
    • Loss of trailing zero
    • Loss of leading zeros
    • Solution:
    • Import > Delimited > Tab > Text format
    • Example:
    • 1944826 =>1944-8260
    • 14826 => 0001-4826
  • Add Hyphens to ISSNs - 1
    • Sort by ISSN to make sure leading zeros are intact.
    • In a new column type the formula =MID(A1,1,4)&”-”&MID(A1,5,4)
    • Syntax: =MID(text,start_num,num_chars)
  • Add Hyphens to ISSNs - 2
    • Or use Cell Formatting
    • Select range
    • Format Cells (CTRL+1) > Number tab > Custom > 0000-0000 > OK
  • Import => Excel, Numbers
    • Problem:
    • ID # garbled
    • Solution:
    • Choose Number format
    • Remove decimal places
  • Import => Excel, Commas
    • Problem:
    • Solution:
    Comma
  • Restore Dropped Leading Zeros
    • Change to TEXT format
    • Sort by ISSN
    • Use =CONCATENATE(“000”,A1) to add zeros
  • Excel: Remove Quotes
    • Select column
    • Find & Select > Replace > Find what: [space]” > Replace with: [leave blank]
  • Remove Non-Printable Characters
    • Use TRIM (removes ASCII value 32 = space character, except single spaces between words)
    • Use CLEAN (ASCII codes 0-31, Unicode 127, 129, 141, 143, 144, 157
    • Use SUBSTITUTE for higher codes
    Book of Hours, c. 1440 “Use of Tornai,” © ARTstor, Rawl.liturg. e.14_roll314.1_frame3
  • ASCII Characters
    • http://en.wikipedia.org/wiki/ASCII 12/22/2009
  • Access: Subscript Out of Range - 1
    • Try These Steps:
    • Check for spaces in column headings
    • Use TRIM (removes ASCII value 32 = space character)
    • Use CLEAN for ASCII code 0-31
    • Delete empty right columns
    • Remove empty “used” cells:
    • Find end of "used cells": CTRL+SHIFT+END
    • Select all empty “used” cells > Edit > Clear > All or Edit > Delete. Save the file.
  • Access: Subscript Out of Range - 2
    • Copy and paste cells into a new workbook. Save. Import into Access.
    • Or, save file as CSV, import into Access. = Will see data error.
    Unknown, circa 1885-1900, “Sorting Mail,” © ARTstor MC212-D-99
  • Access: Type Conversion Failure
    • Make sure data types in fields match data types in columns.
    • Data like ISBNs are text but can be “read” like numbers.
    • Add top row with correct data/type: XXX for ISBN
  • Access: Remove Quotes
    • Search for records
    • with “”: Criteria: LIKE "*" & Chr(34) & "*"
    • Replace([SomeField],Chr(34),"") will replace a quotation mark (") with a zero-length string
    © Erich Lessing, Bayeaux Tapestry, c. 1070-80, ARTstor 31-01-01/23
  • Access: ISSN Issues
    • Find too-short ISSNs: Len([FieldName])<n [9 is good here]
    • Find ISSNs without hyphens: SELECT table.field, table.field FROM table WHERE (((table.field) Not Like &quot;*-*));
    © Eve Arnold, 1979, “Hsishuang Panna Weeding,” ARTstor
  • Access & ARL Stats - 1
    • =Sum([YTD Total]) Sum of article downloads in COUNTER Journal 1 report.
    • =[Jan-09]+[Feb-09]+[Mar-09]+[Apr-09]+[May-09]+[Jun-09] Sum of Jan-Jun 2009 COUNTER J1.
    • =[Jul-09]+[Aug-09]+[Sep-09]+[Oct-09]+[Nov-09]+[Dec-09] Sum of Jul-Dec 2009 COUNTER J1.
  • Access & ARL Stats - 2
    • RowCount:Count(*) Number of titles in a set.
    • =[YTD Total]*[Cost] Annual cost-per-use.
    • Access Expressions: http://office.microsoft.com/en-us/access/HA011814491033.aspx
  • Access or Excel?
    • Access:
    • Relational
    • Large amount of data
    • Primary key
    • Many people working
    • Long text strings
    • Excel:
    • Non-relational
    • Mostly numeric
    • Calculations/Statistics
    Nelson, Emma. 2010. Using Access or Excel to Manage Your Data. http://office.microsoft.com/en-us/help/HA010429181033.aspx See also: Microsoft. 2010. Examples of Expressions. http://office.microsoft.com/en-us/access/HA011814491033.aspx
  • XML - Excel
    • Excel can interpret XML
    • Data > Get External Data > From XML Data Import
    • Format without affecting source data
    Later Excel: Activate Developer tab through Office logo (upper left)
  • MarcEdit: XML - 1
    • Convert large XML files to Excel
    • Specify input, output files
    • Choose MARC21XML => MARC
  • MarcEdit: XML - 2
    • Choose display fields, input, output files
    • View, format in Excel
  • MarcEdit: MARC - 1
    • Convert large files to local practices
    • DELETE existing
    • 999 field
    • 910 field(s)…
    • Copy 035 to 001
    • Remove (Sc-P) Prefix from 001
    • ADD
    •   910 |aDEL SCP ; jc ; 2009/7/8
    • Field: 910
    • Field data: $aDEL SCP ; jc ; 2009/7/8
    • 998 |an
    • Field: 998
    • Field data: $an
  • MarcEdit: MARC - 2
  • MarcEdit Information
    • By Terry Reese
    • http://oregonstate.edu/~reeset/marcedit/html/
    • MARCEDIT-L listserv at  https://listserv.gmu.edu/cgi-bin/wa?SUBED1=marcedit-l&A=1   
    • Regular updates
    • Tutorials, templates, scripts
  • ILS: “Global Update”
    • For records within ILS system
    • For universal changes
    • “ Check website for coverage.”
  • A Better Way? Macros
    • microsoft.public.excel (General Excel group) http://groups.google.com/groups/dir?sel=33606583&hl=en
    • OzGrid Forum (Excel tips and VBA macros) http://www.ozgrid.com/forum/
    • http://www.lib-stats.org.uk/ (statistics listserv)
    • [Courtesy Tansy Matthews]
  • Data Quality
    • Strategies to Improve Data Quality:
    • Identify problems
    • Treat data as an asset
    • Implement quality systems
    • Principle Activities for Data:
    • Acquire
    • Store
    • Use
  • Poor Quality Data Indicators
    • Uncorrected errors
    • Redundant data/ processes
    • Lack of data for strategizing
    • Frustration with data, data supplier, IT
    (c) 2006, SCALA, “Shoes,” 6 th century BCE, ARTstor
  • Treat Data as an Asset
    • Inventory data assets
    • Data = dynamic; process = asset
    • Align responsibilities: acquire, store, use data.
    • Establish customer-supplier relationships for data.
    © The Metropolitan Museum of Art, “Tournament,” late 16 th century, ARTstor
  • Apply Quality Principles
    • Create and keep a customer
    • Detect and correct errors
    • Determine root cause of defects
    • Manage the process
    • Communicate results
    • Audit supplier performance
  • Library Data Quality Policy - 1
    • Suppliers/Creators:
    • Understand users, uses, & requirements
    • Ensure requirements are met
    • Manage data creation process
    • Data Processors:
    • Avoid duplication
    • Safeguard data
    • Make data accessible
    • Promote data quality in IT
  • Library Data Quality Policy - 2
    • Users:
    • Define requirements, work with suppliers
    • Provide feedback
    • Interpret data correctly
    • Use data legitimately
    • Protect privacy
    • Logistics:
    • Determine master systems
    • Understand system limitations
    • Accessible storage
    • Match inputs with needs
    • Identify key keepers
  • [email_address] 951-827-2937
    • Digging for Coal, The Illustrated Bartsch, vol 85, 1486, ARTstor 8586.1486/154
    Other Techniques?
  • Bibliography
    • Microsoft, 2009. Top Ten Ways to Clean Your Data. http://office.microsoft.com/en-us/excel/HA102218401033.aspx, accessed 12/18/2009.
    • Use error checking to convert numbers that are stored as text to numbers. http://office.microsoft.com/en-us/excel/HP012167611033.aspx, accessed 12/22/2009.
    • Apply a number format to numbers that are stored as text http://office.microsoft.com/en-us/excel/HP012167611033.aspx
    • Redman, Thomas C. 1995. Improve Data Quality for Competitive Advantage. Sloan Management Review, 36:2, 99-107.
    • Rothschiller, Chad. 2007. Manipulating and Massaging Data in Excel. http://blogs.msdn.com/excel/archive/2007/11/12/manipulating-and-massaging-data-in-excel.aspx 12/18/2009.
    • Spencer, John. March 6, 2008. Find/ /r eplace characters like quotes http://www.eggheadcafe.com/software/aspnet/31782118/findreplace-characters-l.aspx 12/22/2009