Taming the Wilde 
Collaborating with Expertise for 
Faster, Better, Smarter 
Collection Analysis 
Jackie Bronicki, Collections and Online Resources Coordinator 
Cherie Turner, Chemical Sciences Librarian 
Shawn Vaillancourt, Education Librarian 
Frederick Young, Systems Analyst
Outline 
Outline of Presentation
Research Questions 
Research Question 1: 
What are the best measurements for evaluating 
the current scope of the collection? 
Research Question 2: 
What subject areas are not adequately covered 
in the current collection?
Methodology 
• Influenced by the Cornell 
University Library Print 
Collection usage report 
• No language analysis 
• No patron analysis 
• Limited formats
Results 
• 889,825 total monograph items in final dataset 
• 425,865 titles that have not circulated (48%) 
• 787,590 titles circulated 5 or fewer times (88%) 
• 861,910 titles that have not circulated in the 
last year (97%)
250000 
200000 
150000 
100000 
50000 
0 
Distribution by LC Class 
A B C D E F G H J K L M N P Q R S T U V Z
Comparing Circulation to ILL Usage 
푃퐸푈 = 
푃푒푟푐푒푛푡 푈푠푎푔푒 
푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠 
푃퐸푈퐵 = 
1.43% 
1.32% 
= 1.08 
푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠퐵 =1.32% 
푃푒푟푐푒푛푡 푈푠푎푔푒퐵 =1.43% 
If PEU>1 Overused 
If PEU<1 Underused 
푅퐵퐻 = 
푃푒푟푐푒푛푡 표푓 퐼퐿퐿 퐵표푟푟표푤푖푛푔 
푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠 
푃푒푟푐푒푛푡 표푓 퐼퐿퐿 퐵표푟푟표푤푖푛푔퐵 = 0.79% 
푅퐵퐻퐵 = 
0.79% 
1.43% 
= 0.6 
Mean RBH=1.54±5.18 
If RBH>Mean RBH Overused 
If RBH<Mean RBH Underused
Comparing Circulation to ILL Usage 
LC 
Subclass 
Percent of 
Holdings 
Percent 
Usage PEU 
Holdings 
Usage 
Percent of ILL 
Borrowing RBH ILL Usage 
B 1.32% 1.43% 1.08 Overused 0.79% 0.60 Underused 
BC 0.09% 0.08% 0.82 Underused 0.05% 0.51 Underused 
BD 0.24% 0.20% 0.84 Underused 0.24% 1.01 Underused 
BF 1.22% 1.78% 1.46 Overused 2.00% 1.64 Overused 
BH 0.07% 0.09% 1.29 Overused 0.05% 0.68 Underused 
BJ 0.22% 0.27% 1.21 Overused 0.18% 0.79 Underused 
BL 0.42% 0.65% 1.56 Overused 0.69% 1.65 Overused 
BM 0.10% 0.07% 0.67 Underused 0.09% 0.95 Underused 
BP 0.13% 0.26% 1.95 Overused 0.34% 2.57 Overused 
BQ 0.04% 0.10% 2.63 Overused 0.32% 8.05 Overused 
BR 0.36% 0.33% 0.91 Underused 0.70% 1.96 Overused 
BS 0.22% 0.16% 0.73 Underused 0.36% 1.62 Overused 
BT 0.16% 0.13% 0.85 Underused 0.40% 2.53 Overused 
BV 0.18% 0.15% 0.86 Underused 0.44% 2.49 Overused 
BX 0.52% 0.29% 0.56 Underused 1.69% 3.23 Overused 
If PEU>1 Overused 
If PEU<1 Underused 
If RBH>Mean RBH Overused 
If RBH<Mean RBH Underused 
Mean RBH=1.54±5.18
Comparing Circulation to ILL Usage 
LC Subclass Holdings Usage ILL Usage Action 
B Overused Underused No Changes 
BC Underused Underused Ease Off 
BD Underused Underused Ease Off 
BF Overused Overused Growth Opportunity 
BH Overused Underused No Changes 
BJ Overused Underused No Changes 
BL Overused Overused Growth Opportunity 
BM Underused Underused Ease Off 
BP Overused Overused Growth Opportunity 
BQ Overused Overused Growth Opportunity 
BR Underused Overused Change Purchasing 
BS Underused Overused Change Purchasing 
BT Underused Overused Change Purchasing 
BV Underused Overused Change Purchasing 
BX Underused Overused Change Purchasing
The More Important Question…..
Initial Challenges – Research Team 
• Sierra Infrastructure 
– What data existed where? 
– Title vs. Item 
– Call Number 
• Defining Input/Output Variables 
– What we could output (circulation) 
• MaRC 
• Scope of Project 
• Building a proper sample
Challenges to Possibilities 
• Understanding the 
question 
• Does the System 
Provide an Answer? 
• What can we do?
Data Mining Challenges – Research Team 
• High Expectations 
• Inconsistency of Data 
– Bad input 
– Batch overlay 
– Doesn’t exist
Data Mining Challenges – Systems Team 
• Scaled Expectations 
• Learning curve 
• Piecing the Data 
Together
Research Questions 
Research Question 1: 
What are the best measurements for evaluating 
the current scope of the collection? 
Research Question 2: 
What subject areas are not adequately covered 
in the current collection?
Initial Output Criteria 
Bibliographic Record 
Call Number 
Subject Headings 
Publication/Copyright Date 
ISBN 
Record Number 
Title 
Item Record 
Copy Number 
Total Number of Checkouts 
Status 
Order Record 
Order Date
Final Output Criteria 
Bibliographic Record 
Item Record 
Call Number 
Total Checkouts 
Last Year Checkouts 
Year to Date Checkouts 
Location 
Call Number 
Publication/Copyright Date 
Record Number 
Title 
Publisher 
Catalog Date 
ISBN
ILL Output Criteria 
• Fields for our analysis 
– Call Number 
– Request Date 
– Filled Date 
– Format 
• Fields for later analysis 
– Lending Library 
– Title 
– Author 
– Publication Date 
– Publisher 
– Language 
– Library Type 
– ISBN 
– OCLC Number
Except…. 
Got Data? 
• What was MaRC telling us? 
• How were fields used?
• ISBN? 
Data Cleaning 
• Location: 143,823 records deleted 
• Call numbers: 14894 records deleted
Lessons Learned 
• Understanding the infrastructure 
– Order records 
– Bib records 
• MaRC 
– Item records 
• Understanding local practice 
• Experts provide guidance and practical 
solutions!
Print and Electronic Serials 
• Challenges 
– Different systems store records 
– Different kinds of usage information available 
– Holdings based analysis 
– Subscription or Subscription + Aggregated 
– Vendor supplied records
Aguilar, W. (1986). The application of relative use and interlibrary demand in 
collection development. Collection Management, 8(1), 15-24. 
Knievel, J. E., Wicht, H., & Connaway, L. S. (2006). Use of circulation statistics 
and interlibrary loan data in collection management. College & Research 
Libraries, 67(1), 35-49.. 
John N. Ochola PhD (2003) Use of circulation statistics and 
Interlibrary loan data in collection management, Collection Management, 27:1, 
1-13,DOI:10.1300/J105v27n01_01 
Mills, Terry R. (1982). The University of Illinois Film Center Collection Use 
Study. http://files.eric.ed.gov/fulltext/ED227821.pdf 
"Report of the Collection Development Executive Committee Task Force on 
Print Collection Usage." (2012).Cornell University Library, 
http://staffweb.edu/system/files/CollectionUsageTF_ReportFinal11-22-10.pdf

Taming the Wilde

  • 1.
    Taming the Wilde Collaborating with Expertise for Faster, Better, Smarter Collection Analysis Jackie Bronicki, Collections and Online Resources Coordinator Cherie Turner, Chemical Sciences Librarian Shawn Vaillancourt, Education Librarian Frederick Young, Systems Analyst
  • 2.
    Outline Outline ofPresentation
  • 3.
    Research Questions ResearchQuestion 1: What are the best measurements for evaluating the current scope of the collection? Research Question 2: What subject areas are not adequately covered in the current collection?
  • 4.
    Methodology • Influencedby the Cornell University Library Print Collection usage report • No language analysis • No patron analysis • Limited formats
  • 5.
    Results • 889,825total monograph items in final dataset • 425,865 titles that have not circulated (48%) • 787,590 titles circulated 5 or fewer times (88%) • 861,910 titles that have not circulated in the last year (97%)
  • 6.
    250000 200000 150000 100000 50000 0 Distribution by LC Class A B C D E F G H J K L M N P Q R S T U V Z
  • 7.
    Comparing Circulation toILL Usage 푃퐸푈 = 푃푒푟푐푒푛푡 푈푠푎푔푒 푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠 푃퐸푈퐵 = 1.43% 1.32% = 1.08 푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠퐵 =1.32% 푃푒푟푐푒푛푡 푈푠푎푔푒퐵 =1.43% If PEU>1 Overused If PEU<1 Underused 푅퐵퐻 = 푃푒푟푐푒푛푡 표푓 퐼퐿퐿 퐵표푟푟표푤푖푛푔 푃푒푟푐푒푛푡 표푓 퐻표푙푑푖푛푔푠 푃푒푟푐푒푛푡 표푓 퐼퐿퐿 퐵표푟푟표푤푖푛푔퐵 = 0.79% 푅퐵퐻퐵 = 0.79% 1.43% = 0.6 Mean RBH=1.54±5.18 If RBH>Mean RBH Overused If RBH<Mean RBH Underused
  • 8.
    Comparing Circulation toILL Usage LC Subclass Percent of Holdings Percent Usage PEU Holdings Usage Percent of ILL Borrowing RBH ILL Usage B 1.32% 1.43% 1.08 Overused 0.79% 0.60 Underused BC 0.09% 0.08% 0.82 Underused 0.05% 0.51 Underused BD 0.24% 0.20% 0.84 Underused 0.24% 1.01 Underused BF 1.22% 1.78% 1.46 Overused 2.00% 1.64 Overused BH 0.07% 0.09% 1.29 Overused 0.05% 0.68 Underused BJ 0.22% 0.27% 1.21 Overused 0.18% 0.79 Underused BL 0.42% 0.65% 1.56 Overused 0.69% 1.65 Overused BM 0.10% 0.07% 0.67 Underused 0.09% 0.95 Underused BP 0.13% 0.26% 1.95 Overused 0.34% 2.57 Overused BQ 0.04% 0.10% 2.63 Overused 0.32% 8.05 Overused BR 0.36% 0.33% 0.91 Underused 0.70% 1.96 Overused BS 0.22% 0.16% 0.73 Underused 0.36% 1.62 Overused BT 0.16% 0.13% 0.85 Underused 0.40% 2.53 Overused BV 0.18% 0.15% 0.86 Underused 0.44% 2.49 Overused BX 0.52% 0.29% 0.56 Underused 1.69% 3.23 Overused If PEU>1 Overused If PEU<1 Underused If RBH>Mean RBH Overused If RBH<Mean RBH Underused Mean RBH=1.54±5.18
  • 9.
    Comparing Circulation toILL Usage LC Subclass Holdings Usage ILL Usage Action B Overused Underused No Changes BC Underused Underused Ease Off BD Underused Underused Ease Off BF Overused Overused Growth Opportunity BH Overused Underused No Changes BJ Overused Underused No Changes BL Overused Overused Growth Opportunity BM Underused Underused Ease Off BP Overused Overused Growth Opportunity BQ Overused Overused Growth Opportunity BR Underused Overused Change Purchasing BS Underused Overused Change Purchasing BT Underused Overused Change Purchasing BV Underused Overused Change Purchasing BX Underused Overused Change Purchasing
  • 10.
    The More ImportantQuestion…..
  • 11.
    Initial Challenges –Research Team • Sierra Infrastructure – What data existed where? – Title vs. Item – Call Number • Defining Input/Output Variables – What we could output (circulation) • MaRC • Scope of Project • Building a proper sample
  • 12.
    Challenges to Possibilities • Understanding the question • Does the System Provide an Answer? • What can we do?
  • 13.
    Data Mining Challenges– Research Team • High Expectations • Inconsistency of Data – Bad input – Batch overlay – Doesn’t exist
  • 14.
    Data Mining Challenges– Systems Team • Scaled Expectations • Learning curve • Piecing the Data Together
  • 15.
    Research Questions ResearchQuestion 1: What are the best measurements for evaluating the current scope of the collection? Research Question 2: What subject areas are not adequately covered in the current collection?
  • 16.
    Initial Output Criteria Bibliographic Record Call Number Subject Headings Publication/Copyright Date ISBN Record Number Title Item Record Copy Number Total Number of Checkouts Status Order Record Order Date
  • 17.
    Final Output Criteria Bibliographic Record Item Record Call Number Total Checkouts Last Year Checkouts Year to Date Checkouts Location Call Number Publication/Copyright Date Record Number Title Publisher Catalog Date ISBN
  • 18.
    ILL Output Criteria • Fields for our analysis – Call Number – Request Date – Filled Date – Format • Fields for later analysis – Lending Library – Title – Author – Publication Date – Publisher – Language – Library Type – ISBN – OCLC Number
  • 19.
    Except…. Got Data? • What was MaRC telling us? • How were fields used?
  • 20.
    • ISBN? DataCleaning • Location: 143,823 records deleted • Call numbers: 14894 records deleted
  • 21.
    Lessons Learned •Understanding the infrastructure – Order records – Bib records • MaRC – Item records • Understanding local practice • Experts provide guidance and practical solutions!
  • 22.
    Print and ElectronicSerials • Challenges – Different systems store records – Different kinds of usage information available – Holdings based analysis – Subscription or Subscription + Aggregated – Vendor supplied records
  • 23.
    Aguilar, W. (1986).The application of relative use and interlibrary demand in collection development. Collection Management, 8(1), 15-24. Knievel, J. E., Wicht, H., & Connaway, L. S. (2006). Use of circulation statistics and interlibrary loan data in collection management. College & Research Libraries, 67(1), 35-49.. John N. Ochola PhD (2003) Use of circulation statistics and Interlibrary loan data in collection management, Collection Management, 27:1, 1-13,DOI:10.1300/J105v27n01_01 Mills, Terry R. (1982). The University of Illinois Film Center Collection Use Study. http://files.eric.ed.gov/fulltext/ED227821.pdf "Report of the Collection Development Executive Committee Task Force on Print Collection Usage." (2012).Cornell University Library, http://staffweb.edu/system/files/CollectionUsageTF_ReportFinal11-22-10.pdf

Editor's Notes

  • #2 Today we’ll be talking about a recent research project we undertook at the University of Houston. In mid-2013 a group of liaison librarians were tasked with developing a large-scale collection assessment project. We decided to undertake a “Gap” Analysis, where we benchmarked the collection at a point in time and checked for gaps in coverage per call number range. We describe gap analysis as looking for the holes in our collection. What we quickly learned is we didn’t have the systems expertise to build a dataset that would accurately reflect the current collection and we needed to quickly collaborate with that expertise to be successful in our research.
  • #3 So for purposes of this presentation we have structured the slides a bit differently. We are essentially going to start from the end point which is our results and then focus on how we got there. We think the results are illuminating but the process to get there was even more interesting.
  • #4 We identified two major research questions to guide our research and help us develop the proper methodology.
  • #5 Much of the initial part of our study here was based on methodology used by Cornell in their print collection analysis. they had just performed a benchmark of their collection which we were looking to repeat as closely as possible. We used same assumptions and definitions the CUL study used. We used the LC call number as a proxy for subject and we parsed out the alphabetical part from the numerical part. Our data can be as broad or as granular as we need it to be. We used circulation data as a proxy for usage which is fairly standard.we didn’t make assumptions about language usage And in fact As I said said earlier we haven’t done any language analysis at this point but the same assumption will hold when we are possible explore them later on There were obvious differences between the study that they did and what we were able to do on in that they were able to get more stratified data by each year based on the system that they were using. We were limited by our systems and proprietary local practices.
  • #7 Distribution of our monograph collection by LC Class. We can also drill down into the subclass for each class but we aren’t showing that here. Notable sections: P – Language and Literature H – Social Sciences Q – Science T – Technology
  • #8 Next I’ll talk a little about how we were able to compare use of our collection to use of our ILL services to get a little more insight into gaps in our collection. The inspiration for this analysis came from a 2005 paper from Mortimore (UNC Greensboro), and the formula to do the analysis was taken from a 2003 paper from Ochola (Baylor University), who in turn derived the component ratios from (Mills 1981) and Aguilar 1986). Percent of Holdings – number of titles in the subclass / total number of titles Percent Usage – number of circulations for titles in the subclass / total number of circulations PEU (Percentage of Expected Use; percentage of the expected use that was actually attained, so the ratio of the actual use to the expected value) – Percent Usage / Percent of Holdings Percent of ILL Borrowing – number of completed ILL requests in the subclass / total number of completed ILL requests RBH (Ratio of Borrowing to Holdings) – Percent of ILL Borrowing/Percent of Holdings
  • #9 Percent of Holdings – number of titles in the subclass / total number of titles Percent Usage – number of circulations for titles in the subclass / total number of circulations PEU (Percentage of Expected Use; percentage of the expected use that was actually attained, so the ratio of the actual use to the expected value) – Percent Usage / Percent of Holdings When the PEU, the ratio of percent of use to the percent of the collection for each subclass, is greater than one, we know that that class is getting more use than we should expect based on the size of the collection in that area. We call that “overused”. If the ratio is less than one, then we’re not seeing as much use as we would anticipate, so we call that “underused”. Next we’ll take a closer look at our ILL borrowing data. Just like for the Holdings Usage, we can make some generalizations about our collection based on usage of ILL service. Percent of ILL Borrowing – number of completed ILL requests in the subclass / total number of completed ILL requests RBH (Ratio of Borrowing to Holdings) – Percent of ILL Borrowing/Percent of Holdings Mean RBH = 1.54 SD RBH=5.18 Because we don’t’ have as much of a baseline for what our ILL borrowing should look like as we do for our purchased collection, we need to set a baseline. In this case, that is the average value of RBH for our collection, which is 1.54+-5.18. One point that should be mentioned is that our SD is very high, which indicates a lot of variance in our ILL usage for different call number ranges. Now that we have a baseline, we can compare the RBH for each subclass to the mean to get some estimate for proportional use for each subject. If our RBH is higher than the mean, that demonstrates strong use of ILL services, with larger numbers of course meaning greater use. If our RBH is lower than the mean, then we can assume that there isn’t terribly strong ILL use for that class or subclass.
  • #10 So, now we know how our users are using our holdings and how they are using ILL for each subclass. We can put those two pieces of information together to provide some recommendations on what we could do to improve the collection. For classes that have strong use of our own collection, and little ILL use, we can assume that our collection is meeting most needs. If users aren’t using our collection, and they aren’t using ILL for a given subclass, there probably just isn’t much demand, so we can slow down on purchasing in that area. If users are using our collection, and they are also using ILL there is clear demand, ideally we might start purchasing more in that area. And if our users aren’t using our collections, but they are using ILL for a subclass, we might be collecting the wrong things for that subject! This analysis does generalize on the use of our collections, but it also provides insight that many selectors may not have from their typical collection development work. Given a recommendation, particularly one like “Change Purchasing” a selector can then look into the more detailed data obtained for this project, figure out what is going wrong in the collection process and use that information to target areas where we have a demonstrated unmet need.
  • #11 The challenge lay not in getting data but how to quickly get the right data that was accurate and could tell the right story. Manual collection was out of the question for a collection this size, hence our need to collaborate with someone that could get us what we need and fast!
  • #12 Systems Team: Recent Migration to Sierra: Adjusting to new setup: Admin/End Users New database structure +/- (From Closed proprietary to Open/Accessible) Made this project more feasible Mid-Project Entry System Limitations Limited System Reports Circulation History Repository of system data not robust enough
  • #13 Systems Team: Providing requested data on a shared system 3 campuses & 7 libraries Project team looking at a subset of those collections Some records are also shared Does the system provide an answer: Reviewing system reports to see if they provide the data required Yes & No. The reports provided most of the data, but no one report provided all the data requested Also, with some reports the shared records got in the way of looking at specific circulation data Recent Migration to Sierra Closed, proprietary database structure to open source SQL based structure Provided better access to data Allowed to filter out most of the data the Project team needed
  • #14 System Team Scale Expectations When project is already underway, sometimes difficult to recommend change I understand you want X, but this is what the system provides Hmm….you might want to rethink that because. . . Order records Subject headings No. of Copies Team was very understanding about limitations, willing to scale back project Learning Curve MARC data vs SQL Not SQL Expert Data Tables & Putting the pieces together Providing sample data and getting feedback MARC to Flat File Inconsistencies in MARC input Extraneous data (ISBN) Pulling together subfield data Eliminating subfield data
  • #15 Scale Expectations Limits on the amount of data stored Excluding some data Order records - when item was added to the system No order record for every item in the system No direct correlation between orders and item records Subject headings are great in MARC, but unwieldy in flat file Team was very understanding about limitations, willing to scale back project Learning Curve Not a SQL guru, but an opportunity to learn more about it Data Tables & Putting the pieces together Providing sample data and getting feedback MARC to Flat File Inconsistencies in MARC input Filtering extraneous data (ISBN) Pulling together subfield data My part in the project is done….or so I thought.
  • #16 Our plans for analyzing our data really depended on two major factors, firstly what our original research questions were, and what data we could get from the system in a meaningful way, or our output criteria. While we aren’t going to go into any further detail about our analysis we will take a look at how our research questions shaped our output criteria, which then shaped our analysis. The two major goals discussed in our research questions are benchmarking the collection, and getting some concept of adequate coverage of our collection. Adequate coverage is a complicated concept, and one that we sometimes struggled with, but our overall focus here is that we want to know whether our collection is meeting needs. A part of this is looking at how our current collection is used, and also at trying to determine if there are gaps. We found that interlibrary loan data was really essential to finding gaps. For this analysis we also looked at how the age of our collection impacts use, and in the future we could also approach how language impacts use. Subject librarians may also do analysis from our data on how different publisher’s materials are used in our collection.
  • #17 So, given our ultimate goals and a very basic understanding of how information is stored in the three different types of records for monographs at our library, we came up with this initial list of output criteria that we hoped to use for our analysis. When developing this list of potential output criteria the team was also doing some sample searches and outputting data, at this point unassisted by Frederick, so in some cases we had a reasonably good idea of what we were really getting from our systems, and sometimes not as much. Call number – primary subject proxy Subject Headings – secondary subject proxy Publication/copyright date – measure of age ISBN – unique identifier Record Number – unique identifier Title – later analysis by subject specialists Copy number – show how many copies of a title we have Total number of checkouts – usage Status – ensure in circulation Order date – measure of when items were added to our collection Some items, like the call number and subject headings, we imagined would provide that meaningful subject divider that our research questions are based around. In the case of publication date we hoped to find out how the age of our collection varies by subject, as well as how that might impact use. The total number of checkouts is, obviously, our primary measure of usage for our collections. And the remaining fields were included primarily out of the hope that they might be useful for the data cleanup stage (for example ISBN, and Record number could be used to help match items between different data sets if needed; status ensures that we are only looking at items that are in our circulating collection, etc.) or for deeper analysis (for example Title, which we hoped would feed future analysis by the subject librarians).
  • #18 So after some exploration into the data and a lot of help from Frederick we had a new set of output criteria. Call number – subject proxy – not always available in the bib record, sometimes in the item record, so pull both Publication/copyright date – age Record number – unique identifier Publisher – allow option for subject librarians to analyze by subject Catalog date – for items newer than the last system change in 1994, shows when collected ISBN – unique identifier Total checkouts – total all time usage Last year checkouts - ? Most recent calendar year checkouts? YTD Checkouts – non calendar year checkouts ( ex. January 31 2013-January 31 2014) Location – location within the libraries Call number, where available, does provide a meaningful tool for subject analysis; however we found that getting a complete list of call numbers from only the bibliographic record was impossible. We also had to use the item record’s call number, and found that in most cases a record for a print monograph would have either a bib call number or an item call number. Subject headings, also initially considered for defining the collection by subject, were never pulled from Sierra. It was decided that the analysis of these headings would be too complicated, and not worth the time! As discussed when we spoke about data cleanup, ISBN turns out to be very inconsistent, and not at all helpful for matching, so it was removed. The catalog date was added in the hopes that it could serve as a way to tell when we purchased items, but since a new system was implemented in 1994 this was not actually an accurate measure of that for many items. We also added a field for YTD Checkouts, which we hoped would give us a measure of how items are used over both the long term and the short term. And finally a location field was added. This field helped us to determine where physical items were located (ex. main library or a branch library) as well as helping us to clean up areas of our collection that did not have valid call numbers like our theses and dissertations.
  • #19 Because our ILL data was so much more complete, we actually removed many fields because they weren’t necessary rather than because of issues in the data. For the data that we used to compare our ILL requests to usage of our collection we started with a report from Illiad created by our library’s head of ILL. Our final data-set there primarily includes fields that we needed for analysis (pretty much just the call number), and fields that may later be helpful for our subject librarians as they do deeper analysis.
  • #20 Thanks to clarifications from Frederick we were able to confirm some oddities in the data fields: bib_recordNUM : thought we were missing characters, but last character is calculated check-digit, so not needed for identification Last_yr_checkout: thought we’d get some circ data here that we could use to time usage of an item, but Frederick confirmed the category has not been used YTD: one exact year of data counting back from the day the report was run. Is this really a call number? Some errors with how data may have overlaid during record loads.
  • #21 Now that we had all of our data and knew what we were actually getting from our data retrieval, we could clean. A lot of our cleaning just involved looking at outliers to the data Things like: Impossible cataloging or publication dates Diacritics in titles/publishers And making sure that the oddities we were seeing here didn’t affect the quality of the data in other areas we had exported from. Ie no data corruption In data cleaning, we had to make decisions to eliminate data from our initial set. Location involved removing quite a number of records. This was mostly because of government documents using the SUDOC classifications instead of LC, which means they wouldn’t fit into our subject analysis. We also removed our dissertation and theses because they used college designators rather than traditional call numbers. Micro formats were usually not catalogued with a call number so we ended up removing a lot of these too Reserves??? We also removed stray items that were from other campuses but had somehow made it into our data. Once we had cleared out problem locations we were able to look at the remaining records for call numbers There were nearly 15000 records removed here, primarily because the only call number we had for them was either interfiled items that had been removed location wise prior (ie SUDOC , reserves, dissertations) or completely blank. In some cases we had 2 options for call number, having both an overall call number for the bibliographic record, and a different call number from the item record. In these cases we opted to use the call number from the item record as this would be the call number patrons would be interacting with most often.