Crushing, Blending, and Stretching Data

4,995 views

Published on

Data Warehousing and Mining Data from Voyager and Other Library and University Systems for Assessment of Library Operations

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
4,995
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Crushing, Blending, and Stretching Data

  1. 1. Crushing, Blending, and Stretching DataData Warehousing and Mining Data from Voyager and Other Library and University Systems for Assessment of Library Operations ELUNA Conference 2008, Long Beach, CA, Friday, August 1, 2008 Ray Schwartz, Systems Specialist Librarian Cheng Library, William Paterson University, Wayne, New Jersey, USA schwartzr2 @ wpunj.edu
  2. 2. Outline• Why Assessment and Why Now?• What is Data Mining and Data Warehousing and Why Do We Do It?• Our Context• Groups and Services• Steps• Reporting 2
  3. 3. Outline• What is Data Mining and Data Warehousing?• Our Context• Groups and Services• Steps• Reporting 3
  4. 4. Have We Always Assessed?• Anecdotally—Yes.• Systematically—Not usually. – Large scale assessment of manual systems (such as serials check-in, and card catalogs, circulation files) are not practical. – Smaller scale and directed assessment is possible. 4
  5. 5. What changed since the days of manual systems?• For many institutions in the West, the Integrated Library System has been in use for over 20 years.• Larger scale assessment is now possible with the electronic systems. 5
  6. 6. 6
  7. 7. 7
  8. 8. What is different now?• New services have come into existence. – Inside libraries • Full-Text Databases • Link Resolvers – Outside of libraries • Google • Amazon 8
  9. 9. 9
  10. 10. What is Data Mining and Data Warehousing• Extracting data from legacy systems and other resources;• cleaning, scrubbing and preparing data for decision support;• maintaining data in appropriate data stores;• accessing and analysing data using a variety of end user tools;• and mining data for significant relationships. • Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall. 10
  11. 11. • The primary purpose of these efforts is to provide easy access to specifically prepared data that can be used with decision support applications such as management reports, queries, decision support systems , executive information systems and data mining.• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall. 11
  12. 12. Of course there are many ways to measure – Scott Nicholson’s Measurement Model 12
  13. 13. Measurement Matrix with methodologies TopicPerspective Library System Use Procedures and Standards Recorded interactions withInternal (Library •Staff survey and interviews interface & materialsSystem) •Audits of collections, systems, •Bibliomining or staff •Transaction/Web Log Analysis •Observation of User Behavior Aboutness and Usability Knowledge states and UserExternal •Surveys and interviews citations to materials •Talk-alouds and inprocess •Surveys and interviews(User) feedback mechanisms •Focus groups •Focus groups •User Citation tracking 13 Nicholson, Scott (2004). A Conceptual framework for the holistic measurement and cumulative evaluation of library services. Journal of Documentation 60(2) p.164-181
  14. 14. Our Context 14
  15. 15. Our University• 9000 undergraduates• 1000 graduates (mostly education majors)• 400 faculty• 800 adjuncts• 1000 staff 15
  16. 16. Our Library• 19 librarians and 26 library staff• 350,000 volumes• 18,000 audiovisual items• 22,000 print and electronic periodicals• 100 general and subject specific databases 16
  17. 17. Our Systems circa 2005• Voyager ILS – Cheng Server• Online Periodical Database (OPD)• Clio ILL Software• EZProxy Server - Zeus• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server 17
  18. 18. Vendor Services• Serials Solutions• OCLC• Blackwell• Ebsco• Marcive• Database Vendors 18
  19. 19. The QuestionWhich categories of patrons are accessing which services? 19
  20. 20. First Step – Patron Statistical Categories 20
  21. 21. • Voyager Patron Database allows a maximum of 10 statistical categories per patron record.• Decide which statistical categories are needed for each patron group defined.• Work with your University Information Systems Department to extract the relevant data from the relevant sources. 21
  22. 22. Groups and Services• Major • Circulation• Status – Books – Media – Undergrad or Grad – Reserve – Faculty, Adjunct Faculty or – By Fund Code Staff – Location• Department • ILL / Document Delivery• College • Databases• Degree • Library Web Pages• – Subject Area Resource Guides No. of Credits – Reference Requests• Year of Study • Catalog• Campus Location • Other Vendor Services – Serials Solutions 22
  23. 23. History Department - 12 months - Feb. 2008 % BORROW CIRC/ CIRC/ PATRON STATUS BOOK CIRC MEDIA CIRC EQUIP CIRC TOTAL CIRC MEMBERS BORROWERS ING MEMBER BORROWERUNDERGRADUATESTUDENTS 2,715 250 698 3,663 238 186 78% 15.39 19.69GRADUATESTUDENTS 419 13 76 508 14 13 93% 36.29 39.08ADJUNCT FACULTY 100 65 20 185 32 20 63% 5.78 9.25FULL-TIME FACULTY 159 115 194 468 24 23 96% 19.50 20.35HISTORY TOTALS 3,393 443 988 4,824 308 242 79% 15.66 19.93LIBRARY TOTALS 23,370 8,713 20,703 52,756 7,418 4,981 67% 7.11 10.59DEFINITIONS:BOOK CIRCULATION = books, book disks, maps, oversize, Curriculum materials, reserve books, NJ History, Leisure LoungeMEDIA CIRCULATION = audio & video materials, including media reservesEQUIPMENT CIRCULATION = camcorders, overhead & data projectors, laptops, easels, DVD players, etc.MEMBER = declared major or department memberBORROWER = any member who borrowed materialsLibrary Total = declared undergrad & grad majors, adjuncts & full time faculty borrowers 23
  24. 24. Problems with Configuration of Services• Little to no linkage of data• Need to search multiple services to get complete picture of serial holdings• Multiple user IDs for authentication 24
  25. 25. Systems Chart – ca. 2005 Cheng Server www.wpunj.edu Online Periodicals Serials Form Perl Database ColdFusion ILL Form Web Server ER Micro Pag Web Server Oracle Form e Voyager Materials Zeus Circulation Media Scheduling Off Campus Dbase Hits Patrons Patrons Searches & ILL Form ( EZProxy Log ) Banner SIS HRS University Networked Drive K:( University ERP System ) University Email Server Patrons Materials ILL ( Cliodata ) Serials Solutions OCLC A to Z WorldCat ILL Other Vendors‘ Database Services Current Relationships Internal Externally & Usage Reports only accessible Non WPUNJ WPUNJ WPUNJ 25 Server Server Server
  26. 26. Retirement the the OPD• Serials holdings data was extracted from the OPD and added to Voyager catalog• From Voyager catalog, serials holdings data is extracted and added to Serials Solutions A to Z list 26
  27. 27. Retirement of the OPD cont.• Authentication of ILL form is routed through the EZProxy server• A web bug is placed in the microform request page to record submission in the Voyager servers web logfile. 27
  28. 28. Systems Chart – ca. 2005 – Retiring the OPD Cheng Server www.wpunj.edu Online Periodicals Serials Form Perl Database ColdFusion ILL Form Web Server ER Micro Pag Web Server Oracle Form e Voyager Materials Zeus Circulation Media Scheduling Off Campus Dbase Hits Patrons Patrons Searches & ILL Form ( EZProxy Log ) Banner SIS HRS University Networked Drive K:( University ERP System ) University Email Server Patrons Materials ILL Serials Solutions OCLC A to Z WorldCat ILL Other Vendors‘ Database Services Current Relationships Internal Externally & Usage Reports only accessible Non WPUNJ WPUNJ WPUNJ 28 Server Server Server
  29. 29. New Services Added• Serials Solutions MARC Record Service• Serials Solutions Link Resolver• OCLC Worldcat Collection Analysis 29
  30. 30. Systems Chart – ca. 2005 – New Services Added Cheng Server www.wpunj.edu Serials Form Perl ColdFusion ILL Form Web Server ER Micro Pag Web Server Form e Voyager Zeus Circulation Media Scheduling Off Campus Dbase Hits Patrons Searches & ILL Form ( EZProxy Log ) Banner SIS HRS University Networked Drive K:( University ERP System ) University Email Server Patrons Materials ILL ( Cliodata ) Serials Solutions OCLC A to Z W WorldCat MARC Records C Link Resolver A ILL Other Vendors‘ Database Services Current Relationships Internal Externally & Usage Reports only accessible Non WPUNJ WPUNJ WPUNJ 30 Server Server Server
  31. 31. Our Systems in 2008• Voyager ILS – Cheng Server• Shared Application Server• Clio ILL Software• EZProxy Server - Zeus• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server 31
  32. 32. Systems Chart - 2008 Cheng Server Application Server www.wpunj.edu Serials Form Perl ColdFusion ILL Form ColdFusion Web Server ER Micro Pag Web Server Form e Voyager Web Server Zeus Circulation Media Scheduling DBMS Off Campus Dbase Hits Patrons Searches & ILL Form OffCampus ILL ILL Dbase Patrons/ Patrons/ ( EZProxy Log ) Usage by Materials Materials Patron Requested Groups Received Banner SIS HRS University Networked( University ERP System ) University Email Server Drive K: Patrons Materials Serials Solutions OCLC ILL ( Cliodata ) A to Z W WorldCat MARC Records C Link Resolver A ILL Other Vendors‘ Database Services & Usage Reports Current Relationships Internal Externally only accessible Non WPUNJ WPUNJ WPUNJ 32 Server Server Server
  33. 33. Second Step – Setup an Application Server 33
  34. 34. What is an Application Server?• A machine or its software that works in conjunction with a web server to deliver application services such as the dynamic creation of a webpage from content stored in a database. From http://www.webtools.ca.gov/help/Glossary.asp• Web Server Software (Apache or IIS)• Database Management System – DBMS (MySQL, Oracle, MS SQL Server)• Scripting Language (Perl, PHP, ColdFusion, ASP) 34
  35. 35. Why an Application Server?• Relevant data in logfiles need to be in a database to be analyze.• Need your own DBMS to create new tables and queries. 35
  36. 36. • Decide how you will use the Application Server.• Decide on the best and most plausible configuration. 36
  37. 37. The Projects• Mining EZProxy logfiles and linking to patron statistical categories from the Voyager Patron Database – What majors and departments are accessing which database services? – What majors and departments are accessing the ILL services? 37
  38. 38. Systems Chart - 2008Integrated Library System Application Server www.wpunj.edu Serials Form Scripting Language Scripting Language Scripting Language ILL Form Web Server ER Micro Pag Web Server Form e Voyager Web Server Proxy Server Circulation Media Scheduling DBMS Off Campus Dbase Hits Patrons Searches & ILL Form OffCampus ILL ILL Dbase Patrons/ Patrons/ ( EZProxy Log ) Usage by Materials Materials Patron Requested Groups Received Banner SIS HRS University Networked( University ERP System ) University Email Server Drive K: Patrons Materials Serials Solutions OCLC ILL ( Cliodata ) A to Z W WorldCat MARC Records C Link Resolver A ILL Other Vendors‘ Database Services & Usage Reports Current Relationships Internal Externally ILL Collection and Patron Group Analyses only accessible Non WPUNJ WPUNJ WPUNJ 38 Off Campus Database Hits by Patron Group Server Server Server
  39. 39. ILL request form authentications by major – Academic year 07/08Article BookCount Major Count Major 62 M- Psychology 90 M- History 60 M- Sociology 28 M- Non-Degree 42 M- Applied Clinical Psych 25 M- Pub Pol & Intl Affairs 35 M- Education 20 M- Spanish 31 M- History 18 M- English 30 M- Spanish 16 M- Undecided 29 M- Nursing 14 M- Art M- Communication 14 M- Education 19 Disorders 11 M- Sociology 19 M- Communication 10 M- Biology 14 M- Biotechnology 9 M- Music 14 M- Counseling 9 M- Special Programs 14 M- English 8 M- Psychology 12 M- Non-Degree 7 M- Biotechnology 10 M- Community/Sch Health 7 M- Political Science 7 M- Biology 6 M- Anthropology 7 M- Political Science 6 M- Music - Jazz Studies 6 M- Undecided 4 M- Business 5 M- Comm Media Studies 4 M- Communication 5 M- Reading 4 M- Nursing 4 M- Business 39
  40. 40. Which Databases are accessed by Majors and Departments?07/29/08
  41. 41. By Major and Host Major Count Host M- Nursing 3377 ebscohost.com M- Non-Degree 3010 ebscohost.com M- Psychology 2303 ebscohost.com M- Counseling 1487 ebscohost.com M- Communication 1359 ebscohost.com M- Education 1267 ebscohost.com M- Business 1246 proquest.umi.com M- Sociology 1152 ebscohost.com M- Business 1145 lexis-nexis.com M- Undecided 1100 ebscohost.com M- Applied Clinical Psych 1075 ebscohost.com M- English 1034 ebscohost.com M- Sociology 916 csa.com M- Business 794 ebscohost.com M- Accounting 738 lexis-nexis.com M- Reading 683 ebscohost.com M- Physical Education 653 ebscohost.com M- Special Programs 600 ebscohost.com M- Non-Degree 463 ereserve.wpunj.edu07/29/08
  42. 42. By Dept and HostDepartment Count HostS- Information Systems 933 webscript.exe?fs.scrS- Psychology Dept. 742 ebscohost.comS- Accounting and Law 559 lexis-nexis.comS- Political Sci Dept. 308 lexis-nexis.comS- Nursing Dept. 204 ebscohost.comS- Market & Mgt. Dept. 175 proquest.umi.comS- Library 167 ebscohost.comS- Sociology Dept. 151 ebscohost.comS- Sociology Dept. 134 csa.comS- History Dept. 121 serials.abc-clio.comS- Exercise & Mov Sci 110 ebscohost.comS- Political Sci Dept. 104 ebscohost.comS- Library 103 ILL_article.cfmS- Library 100 webscript.exe?fs.scrS- History Dept. 94 webscript.exe?fs.scr07/29/08
  43. 43. By Dept and ServiceDepartment Count ServiceS- Information Systems 933 http://www.wpunj.edu/scripts/webscript.exe?fs.scrS- Accounting and Law 549 http://www.lexis-nexis.com/universeS- Psychology Dept. 364 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psychS- Nursing Dept. 114 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=c8hS- Sociology Dept. 96 http://www.csa.com/htbin/dbrng.cgi?&db=socioabs-set-c&adv=1S- Sociology Dept. 75 http://search.ebscohost.com/login.asp?profile=asp http://webspirs4.silverplatter.com:8900/c119646?S- Philosophy Dept. 74 sp.form.first.p=srchmain.htm&sp.dbid.p=S(PHILS- Library 65 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=aspS- Anthropology Dept. 62 http://www.sciencedirect.com/S- History Dept. 61 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=AHLS- Psychology Dept. 61 http://search.ebscohost.com/login.asp?profile=psyartS- History Dept. 58 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=HAS- Psychology Dept. 54 http://search.ebscohost.com/login.asp?profile=psychS- Psychology Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psyartS- English Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=mzh 07/29/08
  44. 44. Some concernsPatron Privacy and Standards07/29/08
  45. 45. Using Voyager as the model for Patron Privacy07/29/08
  46. 46. • Active Circ transactions are stored in a table with patron ID and statistical categories.• Completed Circ transactions are stored in a table without the patron ID, but still with the patron statistical categories.• The Patron Table contains the total counts of transactions for each patron, but no link to which transactions they are.07/29/08
  47. 47. • EZProxy transactions would be stored in one table with patron statistical categories, but without the user ID.• User ID s would be stored in another table with counts for each service divided by academic year.• Logs are collected monthly and loaded and deleted monthly.07/29/08
  48. 48. Example of EZProxy log entry• Ip address nj.dhcp.embarqhsd.net• (Not used) -• user id theuser• date/time 1/1/2008 4:25:15 AM• Method GET• page http://ezproxy.wpunj.edu:2048/connect? session=sGHMbeSss121YxZa&url=http://www.wpunj.edu/scripts/ retrieved webscript.exe?fs.scr HTTP/1.1• Version 302• response code• no. of bytes 537• Referring http://ezproxy.wpunj.edu:2048/login? url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr URL Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR• User agent 1.1.4322) 48
  49. 49. Perl Script for loading ezproxy log into MySQLuse strict;my%month=(Jan=>01,Feb=>02,Mar=>03,Apr=>04,May=>05,Jun=>06,Jul=>07,Aug=>08,Sep=>09,Oct=>10,Nov=>11,Dec=>12);while (<>){ my $pattern = ^(S*) (S*) (S*) (S*) . [(..)/(...)/(....):(..):(..):(..) .....]. "(S*) (S*) (S*)" . (d*) (-|d*) "([^"]*)" "([^"]*)"; if (m/$pattern/){ my ($tgt,$ref,$agt) = (esc($12),esc($16),esc($17)); my $byt = $15 eq _?NULL:$15; print "INSERT INTO ezproxylogs VALUES ($1,$2,$3,". " TIMESTAMP $7/$month{$6}/$5 $8:$9:$10,$11,$tgt,". "$13,$14,$byt,$ref,$agt);r."; }else{ print "--Skipped line $.n"; }}sub esc{ my ($p) = @_; $p =~ s///g; return $p;} 49
  50. 50. Created table to assist the linkingSELECT PATRON_ADDRESS.ADDRESS_TYPE,Left([ADDRESS_LINE1],InStr([ADDRESS_LINE1],"@")-1) AS usr ,PATRON_ADDRESS.PATRON_ID,PATRON_ADDRESS.ADDRESS_STATUS,PATRON_ADDRESS.EFFECT_DATE,PATRON_ADDRESS.EXPIRE_DATE,PATRON_ADDRESS.MODIFY_DATE,PATRON_ADDRESS.MODIFY_OPERATOR_ID INTOemailprefixFROM PATRON_ADDRESSWHERE(((PATRON_ADDRESS.ADDRESS_TYPE)="3")); 50
  51. 51. The question of standardsNeed standards to share data for comparative research 51
  52. 52. Types of ReportingEmail ReportsPeriodic - e.g., Daily DossiersEvent TriggeredOn DemandEmail, web or printUse by Dept/MajorUse by Fund Code Purchases 52
  53. 53. Questions? Ray Schwartz, Systems Specialist LibrarianCheng Library, William Paterson University, Wayne, New Jersey, USA schwartzr2 @ wpunj.edu 53

×