Your SlideShare is downloading. ×
Download
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Download

438
views

Published on


0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
438
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Notes:This presentation looks at the Penn Library Data Farm project, a six year effort to think about and build systems that support decision making and assessment needs of a large research library. There are two parts to the talk: the first describes the current state of affairs with the Data Farm. After several years we’ve learned what works well and what challenges are particularly tenacious. Which leads into the second part of the talk: how can we apply lessons learned and how would we proceed if we were beginning anew. Independently of this outline--where are we and where might we go--there is an overarching theme: namely that the resources needed to build decision-support frameworks, to support assessment and intelligence gathering, if you will, are present in the architecture of our digital systems and in the profusion of data and data structures that make up, surround, and connect to these systems. Data structures, data manipulations (in the engineering sense rather than the statistical) and our efforts to exploit them are the underlying themes of this talk.
  • A report like this on Penn’s Biology fund, with color codes that represent the grandparent, parent and child funds, the title-level purchases, publisher, split among other funds, various dollar amounts, and relevant dates. Here the output represents a third data structure in the chain, and I hasten to note a very processed but still raw data file. If the consumer of this report is interested in producing a time series or something more complicated, that capability is available in excel and she’s free to ask a number of different questions…If she discovers something later that she didn’t know she wanted to know at the start, the data structure-- hopefully -- will be amenable to further manipulation. If not in excel than maybe at the data farm level….
  • As in this case, where selectors came to us after the report builder was designed and asked to be able to pivot funds by publisher. This required a small alteration to the sql in the report script, and some modifications to formatting code dfor excel.
  • Re-structure it in datafarm as a simpler table that can reproduce the fund hierarchies resident in voayger, but in a comprehnesive way, unlike Voyagers own fund reporting clients. The result is
  • I began by saying this talk is about experiments we’ve been doing to repurpose the data that live around us, that Data Farm is in the first place about finding and leveraging that data structures that live around us. As I mentioned these can live in logs, or other highly structured text files, or they can be databases, That these data structures should command attention before we consider what statistics the organization needs, at least from the perspective of developing intelligence frameworks. So let me drill in a little closer and look at a just one example of how this re-structuring is operationalized in data farm because it raises lots of issues for such projects, issues that are both tactical--that’s is how work is accomplished-- and strategic--why are we doing this in the manner we do.. Here are relationships in the voyager database that are necessary to track fund expenditures for books, serials and e-resources. We can invision the data structure as a diagram or as an sql query. We use this set of relationships to harvest spevific data from Voyager and…
  • Here’s the high-altitude plan of data farm. We begin in the environment where service occurs and identify lots of potential data sources. They’re the systems behind our catalogs, circulation services, and fund accounting (systems like Voyager or III), they include our web service logs, link resolvers, erms (ered is a home-grown erm and web page delivery application), they include 3rd party systems that might operate at the consortial level (that’s what Sirsi-Dynix is to us) and they include the human-driven services of research and instruction. In the data farm setting, we harvest lots of data from this service sphere as well as information about people and networks in our community. Processes in the management info sphere (The Data Farm Environment) capture, clean, normalize and anonymize these sources and feed them into a central repository, which for us is a relational database. We then build on top of the repository various kinds of tools to enable staff to acquire and interact with data. The analytical work of staff and the decisions they make feedback into the service environment through planning, collection development, staff deployment and other processes.
  • Just to try to represent the scale of Data Farm…
  • Transcript

    • 1. Building Frameworks of Organizational Intelligence Library Assessment Conference, 2008 Seattle Joe Zucca Director for Planning and Communication University of Pennsylvania Libraries
    • 2. Sample Data Structure I: Data Farm Funds Report - Biology
    • 3. Sample Data Structure I: Data Farm Funds Report - Casalini Libri
    • 4. Sample Data Structure II - Data Farm Table - Funds Process
    • 5. Sample Data Structure III - Voyager Funds Schema Invoice_ line_item invoice_ status purchase_ order line_item invoice_ line_item _funds fund invoice vendor bib_ master bib_text
    • 6. Leveraging data structures Information Loop Data Farm Environment Clean | Anonymize Normalize Data Streams Integrate Inform Report Builder Dashboard Data Bureau Service Environment Catalog Funds Circulation ERED DYNIX BorrowDirect SFX | ERM? WEB Apache ezproxy People and Network Data Ref|Instruct
    • 7. Reference Contact Circulation Acquisitions Funds Holdings Copier | Printer Use Gate Swipes Web Analytics Image Collection Use Data Farm Oracle Space-Overview 30+ gb, in 162 tables, for collecting and disseminating management info Tech Processing Workflow Consortia ILL/DocDel 70-Member ILL Coop. E-Resource Use Resolve Resource Resolve People Resolve Places Staff Census LDAP Digital Library Voyager-Supported Services Building Use Resolution Services Administration Reference & Instruction (dynamic)
    • 8. Service Environment WEB Apache ezproxy Ref|Instruct Data Farm Silos funds Catalog Funds Circulation ERED DYNIX BorrowDirect SFX | ERM?
    • 9. A different take on a management information framework
    • 10. xxx.xx.xxx.xxx|-|zucca|[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F Sample Data Structure: EZ-Proxy Log
    • 11. xxx.xx.xxx.xxx |-|zucca| [26/Jul/2007:15:41:01 -0500] | GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X ; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari /419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F xxx.xx.xxx.xxx = .edu | on-campus | Van Pelt Library | staff office July 26, 2007: 3:41 p.m. Device= Mac | OSX Browser= Safari Sample Data Structure: EZ-Proxy Log
    • 12. xxx.xx.xxx.xxx|-|zucca|[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0| http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59 | Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F Referring URL = library web page listing resources on Psychology (community 59) | Olson, Coordinating Bibliographer Sample Data Structure: EZ-Proxy Log
    • 13. xxx.xx.xxx.xxx |-| zucca |[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F zucca [Penn authentication key] = Provost cntr | LIBRARY | staff Sample Data Structure: EZ-Proxy Log
    • 14. xxx.xx.xxx.xxx |-|zucca|[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login? proxySessionID=10335905 &url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F SessionID 10335905= PsycInfo (resource identifier 7014) Sample Data Structure: EZ-Proxy Log
    • 15. xxx.xx.xxx.xxx |-|zucca|[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F sfx_session_id=Journal of Experimental Child Psychology , “ Relations Among Musical Skills, Phonological Processing and Early Reading Ability in Preschool Children.” Sample Data Structure: EZ-Proxy Log
    • 16. xxx.xx.xxx.xxx|-|zucca|[26/Jul/2007:15:41:01 -0500]| GET https://proxy.library.upenn.edu:443/login?proxySessionID=10335905&url= http://www.csa.com/htbin/dbrng.cgi?username=upenn3&access=upenn34&cat=psycinfo&adv=1 HTTP/1.1| 302|0|http://www.library.upenn.edu/cgi-bin/res/sr.cgi?community=59| Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en) AppleWebKit/418.9.1 (KHTML, like Gecko) Safari/419.3| NGpmb6dT6JXswQH|__utmc=94565761; ezproxy=NGpmb6dT6JXswQH; hp=/; proxySessionID=10335514; __utmc=247612227; __utmz=247612227.1184251774.1.1.utmccn=(direct)| utmcsr=(direct)|utmcmd=(none); UPennLibrary=AAAAAUaWP5oAACa4AwOOAg==; sfx_session_id=s6A37A3E0-3B8E-11DC-80E9-85076F88F67F TRANSLATION :
      • Made an open-url connection to an article on reading aptitude published in the Journal of Experimental Child Psychology.
      • Logged into PsychInfo (CSA)
      • Using a staff computer (MAC|OSX|Safari) located in Van Pelt Library
      • A library staff (Provost’s center affiliate)
      • July 26, 2007 at 3:41 pm
      and in the same session, Sample Data Structure: EZ-Proxy Log
    • 17. EVENT Events : a Structure for Library Decision Metrics Org|Program Staff Service Genre Library Program Campus Location Date|Time IP-Domain Environment URL Budget Restricted Unrestricted Expnd Amt Fund Code Dept School Client Rank Major Biblio- graphic Content School Reqmnt Dept Dept Level Course Descriptor Client Program
    • 18. Service Environment WEB Apache ezproxy Ref|Instruct Data Farm Silos funds Catalog Funds Circulation ERED DYNIX BorrowDirect SFX | ERM?
    • 19. DATA STORE query (sql) parse, resolve, anonymize, secure! Data Farm Tiered Framework oracle XML Repository processes Analysis perl XL Service Environment Catalog Funds Circulation ERED DYNIX BorrowDirect SFX | ERM? WEB Apache ezproxy Ref|Instruct
    • 20. Event Elements Articulated in an XML Schema
      • < ENVIRONMENTAL >
        • <v.domain>
        • <p.domain>
        • <date>
        • <time>
        • <session>
        • <url>
      • </ ENVIRONMENTAL >
      • < CLIENT >
        • <school>
        • <dept>
        • <rank>
      • </ CLIENT >
      • < LIBRARY PRGRM >
        • <service genre>
        • <staff_name>
        • <staff_org>
        • <staff_prgm>
      • </ LIBRARY PRGRM >
      • < CONTENT >
        • <title>
        • <author>
        • <holdings>
        • <call_no>
        • <isbn>
        • <issn>
        • <url>
        • <res_id>
        • <sfx_id>
      • </ CONTENT >
      METRIDOC EVENT Library Program Environment Client Content Client Program Budget
    • 21.
      • < ENVIRONMENTAL >
        • <v.domain>
        • <p.domain>
        • <date>
        • <time>
        • <session>
        • <url>
      • </ ENVIRONMENTAL >
      • <CLIENT>
        • <school>
        • <dept>
        • <rank>
      • </CLIENT>
      • <LIBRARY PRGRM>
        • <service genre>
        • <staff_name>
        • <staff_org>
        • <staff_prgm>
      • </LIBRARY PRGRM>
      • <CONTENT>
        • <title>
        • <author>
        • <holdings>
        • <call_no>
        • <isbn>
        • <issn>
        • <url>
        • <res_id>
        • <sfx_id>
      • </CONTENT>
      Event Elements Articulated in an XML Schema METRIDOC Campus Location Date|Time IP-Domain Environment URL EVENT Library Program Environment Client Content Client Program Budget
    • 22.
      • <ENVIRONMENTAL>
        • <v.domain>
        • <p.domain>
        • <date>
        • <time>
        • <session>
        • <url>
      • </ENVIRONMENTAL>
      • < CLIENT >
        • <school>
        • <dept>
        • <rank>
      • </ CLIENT >
      • <LIBRARY PRGRM>
        • <service genre>
        • <staff_name>
        • <staff_org>
        • <staff_prgm>
      • </LIBRARY PRGRM>
      • <CONTENT>
        • <title>
        • <author>
        • <holdings>
        • <call_no>
        • <isbn>
        • <issn>
        • <url>
        • <res_id>
        • <sfx_id>
      • </CONTENT>
      Event Elements Articulated in an XML Schema METRIDOC Dept School Client Rank Major EVENT Library Program Environment Client Content Client Program Budget
    • 23.
      • <ENVIRONMENTAL>
        • <v.domain>
        • <p.domain>
        • <date>
        • <time>
        • <session>
        • <url>
      • </ENVIRONMENTAL>
      • <CLIENT>
        • <school>
        • <dept>
        • <rank>
      • </CLIENT>
      • < LIBRARY PRGRM >
        • <service genre>
        • <staff_name>
        • <staff_org>
        • <staff_prgm>
      • </ LIBRARY PRGRM >
      • <CONTENT>
        • <title>
        • <author>
        • <holdings>
        • <call_no>
        • <isbn>
        • <issn>
        • <url>
        • <res_id>
        • <sfx_id>
      • </CONTENT>
      Event Elements Articulated in an XML Schema METRIDOC EVENT Library Program Environment Client Content Client Program Budget Org|Program Staff Service Genre Library Program
    • 24.
      • <ENVIRONMENTAL>
        • <v.domain>
        • <p.domain>
        • <date>
        • <time>
        • <session>
        • <url>
      • </ENVIRONMENTAL>
      • <CLIENT>
        • <school>
        • <dept>
        • <rank>
      • </CLIENT>
      • <LIBRARY PRGRM>
        • <service genre>
        • <staff_name>
        • <staff_org>
        • <staff_prgm>
      • </LIBRARY PRGRM>
      • < CONTENT >
        • <title>
        • <author>
        • <holdings>
        • <call_no>
        • <isbn>
        • <issn>
        • <url>
        • <res_id>
        • <sfx_id>
      • </ CONTENT >
      Event Elements Articulated in an XML Schema Biblio- graphic Content METRIDOC EVENT Library Program Environment Client Content Client Program Budget
    • 25. Client RSS Oracle XLS XML Data Repository Resolver Schema Repository SQL Generator Soap Connection Repository Tier II MIS Tools & Services Tier III Voyager | People Data | ERM Admin Interface Process Data Logs & other Data sources Issues data as event-level metridoc xml Raw data ingest and handoff for resolving into Metridoc SQL Generator spawns tables following user-defined schema Like a prism, the SQL Generator parses metridoc info into relational structures within the Data Repository Soap Connection Data Farm Multi-Tiered Architecture Metridoc Setting Admin users create metridoc schema, specifying structures for raw data sources Data Ingest Tier I
    • 26.  
    • 27.  
    • 28. Use of Electronic Journals, Subject Correlations   AGRI SCI ARTS & HUM BUS ECON CHEM EARTH SCI ENGN- RNG ENVIRO SCI HEALTH SCI INFO SCI INFO TECH LAW LIFE SCI MATR'L SCI MATH SCI PHYS SOC SCI TELE COM AGRISCI 1                                 ARTS & HUM -0.115 1                               BUSECON -0.142 0.148 1                             CHEM 0.753 -0.044 -0.079 1                           EARTHSCI 0.591 -0.022 0.015 0.637 1                         ENGNRNG 0.478 -0.084 0.001 0.791 0.477 1                       ENVIRO Sci 0.773 0.043 -0.040 0.961 0.766 0.663 1                     HEALTH Sci 0.484 0.041 -0.036 0.760 0.092 0.521 0.681 1                   INFOSCI -0.164 0.899 0.173 -0.135 -0.002 -0.176 -0.003 0.034 1                 INFOTECH 0.189 0.199 0.388 0.443 0.452 0.810 0.382 0.142 0.156 1               LAW -0.090 0.607 0.425 -0.058 0.092 -0.090 0.098 0.042 0.732 0.231 1             LIFESCI 0.711 0.001 -0.079 0.952 0.401 0.697 0.885 0.872 -0.105 0.300 -0.020 1           MATR'L SCI 0.338 -0.200 -0.077 0.521 0.692 0.799 0.474 0.037 -0.213 0.773 -0.175 0.281 1         MATHSCI 0.717 0.025 -0.015 0.980 0.670 0.858 0.932 0.676 -0.077 0.589 -0.030 0.906 0.617 1       PHYS 0.353 -0.172 -0.050 0.518 0.710 0.797 0.473 0.000 -0.214 0.790 -0.174 0.272 0.988 0.624 1     SOCSCI -0.060 0.720 0.474 0.044 0.038 -0.024 0.170 0.231 0.804 0.265 0.954 0.111 -0.219 0.070 -0.215 1   TELECOM 0.060 -0.121 0.035 0.278 0.238 0.796 0.129 0.000 -0.182 0.848 -0.148 0.132 0.835 0.408 0.839 -0.159 1
    • 29. Penn’s Use of Electronic Journals by Subject: Multi-Dimensional Scaling Model Multidimensional Scale of E-Resource Use by Subject – All Users
        • Dimension II
        • “ Humanistic-Science”
        • Dimension II
        • “ Qualitative-Quantitative”
      Source data: Ezproxy logs, June-August 2007 and February-March 2008. 215,000 observations total
    • 30. Multidimensional Scale of E-Resource Use by Faculty Grouped by School
        • Dimension II
        • “ Humanistic-Science”
        • Dimension II
        • “ Qualitative-Quantitative”
      Source data: Ezproxy logs, June-August 2007 and February-March 2008. 215,000 observations total Penn’s Use of Electronic Journals by Faculty: Multi-Dimensional Scaling Model
    • 31. Penn’s Use of Electronic Journals Example of text mining on article keywords