The document discusses balancing library strategic needs for data-driven insights with patron privacy. It outlines principles of data collection, policies on confidentiality, and methods used by the Seattle Public Library to de-identify patron data, including removing personally identifiable information (PII) and truncating identifiers. The library uses techniques such as aggregating data, replacing timestamps with dates, and truncating call numbers when extracting, transforming, and loading data into a warehouse to enable analysis while protecting patron privacy.
This document discusses the use of virtual worlds in education. It notes that Second Life is currently the most widely used virtual world by UK academics, with activity at nearly all UK universities but more patchily in colleges. The main subject areas that utilize virtual worlds are listed as health and medicine, physics simulations, art and fashion, crime/forensics, midwifery, mechanics/history, and micro-level simulations. However, attitudes towards virtual worlds are mixed, with some seeing benefits but others raising concerns about usability, costs, and appropriateness for educational use.
This document summarizes findings from snapshots tracking the use of virtual worlds like Second Life in UK higher education. It finds that the number of institutions and users has grown substantially since 2007. While funding has increased, academics still desire more time and resources. Teaching practices are increasing but effectiveness is not always measured. Other virtual worlds discussed include OpenSim, Wonderland, and Google Lively. The document advocates continuing to explore virtual worlds while recognizing Second Life may not be optimal long-term. It announces a new virtual world watch project to further track trends.
Virtual World Watch - summary of Second Life SnapshotsEduserv Foundation
The document summarizes snapshots of UK university and college activities in virtual worlds like Second Life. It finds that the number of institutions and funded projects using virtual worlds for teaching, research, and marketing has grown significantly since the first snapshot in 2007. While Second Life remains prominent, alternatives like OpenSim and Wonderland are increasingly being explored. Key challenges mentioned include the need for more funding, technical support, and evidence of educational effectiveness. The use of virtual worlds in education is expected to continue growing but may take many years to become mainstream.
1. The document discusses the history and trends of using virtual worlds in education from 2006 to 2014.
2. It notes that while usage has grown steadily among some practitioners and universities, adoption in the UK is slower than in the US.
3. Challenges include the learning curve for students, difficulties with collaboration and communication tools, and testing learning outcomes quantitatively.
4. Predictions for the short term include virtual worlds like Second Life remaining viable while niche uses are explored, and rising costs increasing remote events and collaboration, while long term predictions are difficult to make.
This document discusses the use of virtual worlds in education. It notes that Second Life is currently the most widely used virtual world by UK academics, with activity at nearly all UK universities but more patchily in colleges. The main subject areas that utilize virtual worlds are listed as health and medicine, physics simulations, art and fashion, crime/forensics, midwifery, mechanics/history, and micro-level simulations. However, attitudes towards virtual worlds are mixed, with some seeing benefits but others raising concerns about usability, costs, and appropriateness for educational use.
This document summarizes findings from snapshots tracking the use of virtual worlds like Second Life in UK higher education. It finds that the number of institutions and users has grown substantially since 2007. While funding has increased, academics still desire more time and resources. Teaching practices are increasing but effectiveness is not always measured. Other virtual worlds discussed include OpenSim, Wonderland, and Google Lively. The document advocates continuing to explore virtual worlds while recognizing Second Life may not be optimal long-term. It announces a new virtual world watch project to further track trends.
Virtual World Watch - summary of Second Life SnapshotsEduserv Foundation
The document summarizes snapshots of UK university and college activities in virtual worlds like Second Life. It finds that the number of institutions and funded projects using virtual worlds for teaching, research, and marketing has grown significantly since the first snapshot in 2007. While Second Life remains prominent, alternatives like OpenSim and Wonderland are increasingly being explored. Key challenges mentioned include the need for more funding, technical support, and evidence of educational effectiveness. The use of virtual worlds in education is expected to continue growing but may take many years to become mainstream.
1. The document discusses the history and trends of using virtual worlds in education from 2006 to 2014.
2. It notes that while usage has grown steadily among some practitioners and universities, adoption in the UK is slower than in the US.
3. Challenges include the learning curve for students, difficulties with collaboration and communication tools, and testing learning outcomes quantitatively.
4. Predictions for the short term include virtual worlds like Second Life remaining viable while niche uses are explored, and rising costs increasing remote events and collaboration, while long term predictions are difficult to make.
Surveying Virtual World use in UK Higher and Further EducationSilversprite
This document discusses the use of virtual worlds and digital games in European libraries based on surveys conducted in the UK. It finds that nearly all UK universities have explored or implemented the use of Second Life for research, teaching, or learning activities. Examples include simulations of dangerous environments, campus tours, filmmaking projects, and medical equipment simulations. While Second Life remains popular, some institutions are experimenting with alternative virtual worlds like OpenSim, which is seen as more flexible and open-source. Overall virtual worlds are seen as a way to enhance lifelong, distance, and flexible learning.
Choosing the best virtual world for your teaching needsSilversprite
This document discusses criteria for choosing the best virtual world for educational needs. It provides perspectives from educators on choosing worlds based on factors like available expertise and support, pedagogical fit, functionality, sustainability, and ability to reuse content. Key considerations include the size of an active educator community, local technical expertise, ability to integrate other services, longevity of the world, and options to transfer content between worlds. Compromise may be necessary as no single world excels in all areas.
A slide deck from 1997, illustrating an academic digital library consortium "project" and all of the things that went wrong with "it". More detail on my work website:
http://www.silversprite.com/?page_id=2242
Learning through falling: Second Life in UK academia.Silversprite
This document discusses the use of virtual worlds for teaching and learning in UK academia. It addresses common questions and concerns about using virtual worlds, provides examples of how universities are using them, and offers tips for securing funding. Some of the benefits highlighted include increased access to lifelong learning, lower costs compared to physical infrastructure, and the ability to be anywhere and attend anything. The document argues that virtual worlds are a useful educational tool when used for the right purposes.
This document summarizes a presentation about meeting federal data sharing requirements. It discusses the history of these requirements and defines good practices for data sharing and stewardship. It also reviews some public data sharing services and provides tips for evaluating them. Key aspects of good data sharing include maximizing access, protecting privacy, ensuring proper attribution, and having long-term preservation and sustainability plans. The presenter emphasizes that restricted-use or sensitive data can be effectively shared through secure virtual environments.
This document discusses maintaining patron confidentiality in libraries. It begins by distinguishing between privacy, which is about individuals, and confidentiality, which is about data. It notes that while libraries are public spaces, laws require protecting identifiable patron data. It then examines expectations of privacy for physical and virtual library use. The document outlines elements that should be included in a patron disclosure policy, such as what data is collected and retained. It reviews relevant confidentiality statutes and discusses issues that can arise, like handling requests from law enforcement. Finally, it provides examples of how to handle common confidentiality questions and resources for developing strong confidentiality policies and procedures.
This handout accompanies slides -- Developing a data mindset to improve stories every day -- taught by Brant Houston at Illinois NewsTrain on April 1, 2022. Houston is the Knight Chair in Investigative Reporting at the University of Illinois, where he oversees an online newsroom, CU-CitizenAccess.org. For more info on the News Leaders Association's NewsTrain, see https://www.newsleaders.org/newstrain.
Northeastern Ohio nonprofit innovators met for first annual Big Data for a Better World conference on November 16 at Hyland Software's sprawling Westake, Ohio campus. Leading Hands Through Technology (LHTT) and Workman’s Circle teamed up to offer the event so local nonprofits could discuss how analytics could be successfully used to keep their organization profitable and ultimately improve the community.
Slides from Thursday 2nd August 2018 - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
This document provides an introduction to data management. It discusses why data management is important, covering key aspects like developing data management plans, file organization, documentation and metadata, storage and backup, legal and ethical considerations, sharing and reuse, and preservation. Effective data management is critical for research success as it supports reproducibility, sharing, and preventing data loss. The document outlines best practices and resources like the library that can help with developing strong data management strategies.
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...ICPSR
Data Sharing with ICPSR was presented at IASSIST 2015 in Minneapolis, MN.
The learning objectives and content cover:
- Federal data sharing requirements and
other good reasons to share data
• Options for sharing data
• Protection of confidentiality when
sharing data
• Data discovery tools
• Online data exploration tools from ICPSR
Librarian RDM Training: Ethics and copyright for research dataRobin Rice
This document provides an overview of ethics and copyright as they relate to research data management. It discusses ethical requirements for collecting human subject data, including obtaining consent and protecting privacy and confidentiality. Certain types of research may be exempt from ethics review. Intellectual property rights can apply to research data depending on the level of creativity in the data's collection and organization. Data licensing is an alternative to asserting copyright that allows explicitly defining how data can be used.
The art of depositing social science data: maximising quality and ensuring go...Louise Corti
The document provides guidance for depositing data into a research data repository. It discusses incentivizing researchers to share data, developing data policies, reviewing data for quality and disclosure risks, preparing documentation, assigning licenses, and providing support to depositors. The role of the repository manager is to work with depositors to prepare data according to best practices and the repository's standards to ensure long-term preservation and access.
Data Literacy: Creating and Managing Reserach Datacunera
This document discusses best practices for creating and managing research data. It covers defining data, the importance of data management, developing a data management plan, file naming conventions, metadata, data sharing and preservation. Key points include making a data management plan addressing types of data, standards, access and sharing policies; using descriptive file names with dates; storing multiple versions of data; and including metadata to explain the data. Resources for data management support are provided.
This document provides an overview of data visualization and data journalism. It defines data visualization as using visual formats like charts and interactive applications to tell stories with data. The field of data journalism grew out of computer-assisted reporting and uses data analysis and visualization techniques to enhance stories. The document discusses the types of skills required of data journalists and provides examples of notable data visualization projects.
This document provides an introduction to data management. It discusses the importance of data management and introduces best practices. These include making a data management plan, properly organizing and naming files, adding descriptive metadata, securely storing and backing up data, considering legal and ethical issues, enabling sharing and reuse, and ensuring long-term preservation. Effective data management is important across all disciplines and throughout the entire data lifecycle from creation to archiving.
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
These slides cover evolving federal research requirements for sharing scientific data. Provided are updates on federal agency responses to the 2013 OSTP memo, guidance on data management plans, resources for data management and curation training for staff/researchers, and tips for evaluating public data-sharing services. ICPSR's public data-sharing service, openICPSR, is also presented. Recording of this presentation is here: https://www.youtube.com/watch?v=2_erMkASSv4&feature=youtu.be
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...SURFevents
The document discusses how to properly share personal data according to GDPR regulations. It outlines four key steps: 1) define the purpose for sharing, 2) protect the data being shared through measures like anonymization, 3) inform the individuals behind the data about the processing, and 4) document the sharing and processing to demonstrate compliance. It provides guidance on determining lawful bases for processing, ensuring valid consent is obtained, protecting sensitive data, informing data subjects of their rights, and conducting privacy risk assessments to design a safe data processing procedure.
Use of data in safe havens: ethics and reproducibility issuesLouise Corti
The document discusses ethics and reproducibility issues related to using data in safe havens. It summarizes the UK Data Service, which curates social science data and uses various safeguards to provide access to controlled data through its spectrum of access. It describes legal gateways for data sharing, the Digital Economy Act, and the UK Statistics Authority's accreditation process for researchers and projects. It also discusses the UK Statistics Authority's ethics self-assessment tool and factors that can impact reproducibility when data and code are behind access restrictions in safe havens.
Surveying Virtual World use in UK Higher and Further EducationSilversprite
This document discusses the use of virtual worlds and digital games in European libraries based on surveys conducted in the UK. It finds that nearly all UK universities have explored or implemented the use of Second Life for research, teaching, or learning activities. Examples include simulations of dangerous environments, campus tours, filmmaking projects, and medical equipment simulations. While Second Life remains popular, some institutions are experimenting with alternative virtual worlds like OpenSim, which is seen as more flexible and open-source. Overall virtual worlds are seen as a way to enhance lifelong, distance, and flexible learning.
Choosing the best virtual world for your teaching needsSilversprite
This document discusses criteria for choosing the best virtual world for educational needs. It provides perspectives from educators on choosing worlds based on factors like available expertise and support, pedagogical fit, functionality, sustainability, and ability to reuse content. Key considerations include the size of an active educator community, local technical expertise, ability to integrate other services, longevity of the world, and options to transfer content between worlds. Compromise may be necessary as no single world excels in all areas.
A slide deck from 1997, illustrating an academic digital library consortium "project" and all of the things that went wrong with "it". More detail on my work website:
http://www.silversprite.com/?page_id=2242
Learning through falling: Second Life in UK academia.Silversprite
This document discusses the use of virtual worlds for teaching and learning in UK academia. It addresses common questions and concerns about using virtual worlds, provides examples of how universities are using them, and offers tips for securing funding. Some of the benefits highlighted include increased access to lifelong learning, lower costs compared to physical infrastructure, and the ability to be anywhere and attend anything. The document argues that virtual worlds are a useful educational tool when used for the right purposes.
This document summarizes a presentation about meeting federal data sharing requirements. It discusses the history of these requirements and defines good practices for data sharing and stewardship. It also reviews some public data sharing services and provides tips for evaluating them. Key aspects of good data sharing include maximizing access, protecting privacy, ensuring proper attribution, and having long-term preservation and sustainability plans. The presenter emphasizes that restricted-use or sensitive data can be effectively shared through secure virtual environments.
This document discusses maintaining patron confidentiality in libraries. It begins by distinguishing between privacy, which is about individuals, and confidentiality, which is about data. It notes that while libraries are public spaces, laws require protecting identifiable patron data. It then examines expectations of privacy for physical and virtual library use. The document outlines elements that should be included in a patron disclosure policy, such as what data is collected and retained. It reviews relevant confidentiality statutes and discusses issues that can arise, like handling requests from law enforcement. Finally, it provides examples of how to handle common confidentiality questions and resources for developing strong confidentiality policies and procedures.
This handout accompanies slides -- Developing a data mindset to improve stories every day -- taught by Brant Houston at Illinois NewsTrain on April 1, 2022. Houston is the Knight Chair in Investigative Reporting at the University of Illinois, where he oversees an online newsroom, CU-CitizenAccess.org. For more info on the News Leaders Association's NewsTrain, see https://www.newsleaders.org/newstrain.
Northeastern Ohio nonprofit innovators met for first annual Big Data for a Better World conference on November 16 at Hyland Software's sprawling Westake, Ohio campus. Leading Hands Through Technology (LHTT) and Workman’s Circle teamed up to offer the event so local nonprofits could discuss how analytics could be successfully used to keep their organization profitable and ultimately improve the community.
Slides from Thursday 2nd August 2018 - Data in the Scholarly Communications Life Cycle Course which is part of the FORCE11 Scholarly Communications Institute.
Presenter - Natasha Simons
This document provides an introduction to data management. It discusses why data management is important, covering key aspects like developing data management plans, file organization, documentation and metadata, storage and backup, legal and ethical considerations, sharing and reuse, and preservation. Effective data management is critical for research success as it supports reproducibility, sharing, and preventing data loss. The document outlines best practices and resources like the library that can help with developing strong data management strategies.
Data Sharing with ICPSR: Fueling the Cycle of Science through Discovery, Acce...ICPSR
Data Sharing with ICPSR was presented at IASSIST 2015 in Minneapolis, MN.
The learning objectives and content cover:
- Federal data sharing requirements and
other good reasons to share data
• Options for sharing data
• Protection of confidentiality when
sharing data
• Data discovery tools
• Online data exploration tools from ICPSR
Librarian RDM Training: Ethics and copyright for research dataRobin Rice
This document provides an overview of ethics and copyright as they relate to research data management. It discusses ethical requirements for collecting human subject data, including obtaining consent and protecting privacy and confidentiality. Certain types of research may be exempt from ethics review. Intellectual property rights can apply to research data depending on the level of creativity in the data's collection and organization. Data licensing is an alternative to asserting copyright that allows explicitly defining how data can be used.
The art of depositing social science data: maximising quality and ensuring go...Louise Corti
The document provides guidance for depositing data into a research data repository. It discusses incentivizing researchers to share data, developing data policies, reviewing data for quality and disclosure risks, preparing documentation, assigning licenses, and providing support to depositors. The role of the repository manager is to work with depositors to prepare data according to best practices and the repository's standards to ensure long-term preservation and access.
Data Literacy: Creating and Managing Reserach Datacunera
This document discusses best practices for creating and managing research data. It covers defining data, the importance of data management, developing a data management plan, file naming conventions, metadata, data sharing and preservation. Key points include making a data management plan addressing types of data, standards, access and sharing policies; using descriptive file names with dates; storing multiple versions of data; and including metadata to explain the data. Resources for data management support are provided.
This document provides an overview of data visualization and data journalism. It defines data visualization as using visual formats like charts and interactive applications to tell stories with data. The field of data journalism grew out of computer-assisted reporting and uses data analysis and visualization techniques to enhance stories. The document discusses the types of skills required of data journalists and provides examples of notable data visualization projects.
This document provides an introduction to data management. It discusses the importance of data management and introduces best practices. These include making a data management plan, properly organizing and naming files, adding descriptive metadata, securely storing and backing up data, considering legal and ethical issues, enabling sharing and reuse, and ensuring long-term preservation. Effective data management is important across all disciplines and throughout the entire data lifecycle from creation to archiving.
Meeting Federal Research Requirements for Data Management Plans, Public Acces...ICPSR
These slides cover evolving federal research requirements for sharing scientific data. Provided are updates on federal agency responses to the 2013 OSTP memo, guidance on data management plans, resources for data management and curation training for staff/researchers, and tips for evaluating public data-sharing services. ICPSR's public data-sharing service, openICPSR, is also presented. Recording of this presentation is here: https://www.youtube.com/watch?v=2_erMkASSv4&feature=youtu.be
Sharing personal data and the GDPR - how can it be done - Francisco Romero Pa...SURFevents
The document discusses how to properly share personal data according to GDPR regulations. It outlines four key steps: 1) define the purpose for sharing, 2) protect the data being shared through measures like anonymization, 3) inform the individuals behind the data about the processing, and 4) document the sharing and processing to demonstrate compliance. It provides guidance on determining lawful bases for processing, ensuring valid consent is obtained, protecting sensitive data, informing data subjects of their rights, and conducting privacy risk assessments to design a safe data processing procedure.
Use of data in safe havens: ethics and reproducibility issuesLouise Corti
The document discusses ethics and reproducibility issues related to using data in safe havens. It summarizes the UK Data Service, which curates social science data and uses various safeguards to provide access to controlled data through its spectrum of access. It describes legal gateways for data sharing, the Digital Economy Act, and the UK Statistics Authority's accreditation process for researchers and projects. It also discusses the UK Statistics Authority's ethics self-assessment tool and factors that can impact reproducibility when data and code are behind access restrictions in safe havens.
As the number of information requests increase exponentially, organizations worldwide can no longer process all information requests in time. Large government organizations have huge departments handling these requests and in some cases these are up to 10 times larger than internal legal departments. This shows the urgency to involve more automation handling the public records requests.
When handling public records requests, many possible levels of automation exist to optimize the process, use resources more effectively, and to deal with increasing data volumes.
workshop session delivered alongside 'Making your thesis legal' workshop in July and September 2013 to PhD, MPhil, DrPh students who are completing their thesis. Discusses standards for sharing data, issues that need addressing, formats, data protection, usability, licenses
This document provides an overview of best practices for managing research data. It discusses why data management is important, how to plan for data management by inventorying data, assessing needs, and planning processes. It also covers topics like file formats, documentation, metadata, methods, standards, and storage considerations for both short and long-term. The document emphasizes documenting all decisions and processes, using open standards when possible, and partnering with libraries or repositories for long-term preservation of shared data.
This document discusses best practices for preparing and sharing research data. It emphasizes obtaining proper consent from participants, performing a risk analysis to avoid re-identification, and considering appropriate sharing methods such as data repositories. Sharing data benefits the research community by encouraging new collaborations and validation of results, but must be balanced with obligations to protect participants and intellectual property. The document provides guidance on topics like data licensing, anonymization, and the policies of research institutions and journals regarding data sharing.
Similar to De-identifying Patron Data for Analytics and Intelligence (20)
Bibliographic Data Spring Cleaning with Sierra DNA Becky Yoose
This document discusses using Sierra DNA to clean up bibliographic records. It provides examples of SQL queries that can be used to find typos, irregular fields, or other issues. Specific examples shown find the word "fom" surrounded by spaces, which is likely a typo. The document encourages exploring different types of queries that could help clean up authority headings, subfields, series statements and other areas. It concludes by thanking the audience and inviting questions.
Bibliographic Data Spring Cleaning with Sierra DNA - HandoutBecky Yoose
This document describes several tables in the Sierra bibliographic database that contain bibliographic metadata. It includes tables for bibliographic records, descriptive data, views, variable fields, and subfields. These tables store data like titles, authors, subject headings, and other bibliographic information that can be queried, indexed, and used for reporting. The document also provides references and resources for working with Sierra's SQL database and MARC standards.
Presented at code4lib 2015, February 10th, in Portland, OR.
“If you have something to say, then say it in code…” - Sebastian Hammer,
code4lib 2009
In its 10 year run, code4lib has covered the spectrum of libtech
development, from search to repositories to interfaces. However, during
this time there has been little discussion about this one little fact
about development - code does not exist in a vacuum.
Like the comment above, code has something to say. A person’s or
organization’s culture and beliefs influences code in all steps of the
development cycle. What development method you use, tools, programming
languages, licenses - everything is interconnected with and influenced
by the philosophies, economics, social structures, and cultural beliefs
of the developer and their organization/community.
This talk will discuss these interconnections and influences when one
develops code for libraries, focusing on several development practices
(such as “Fail Fast, Fail Often” and Agile) and licensing choices (such
as open source) that libtech has either tried to model or incorporate
into mainstream libtech practices. It’ll only scratch the surface of the
many influences present in libtech development, but it will give folks a
starting point to further investigate these connections at their own
organizations and as a community as a whole.
tl;dr - this will be a messy theoretical talk about technology and
libraries. No shiny code slides, no live demos. You might come out of
this talk feeling uncomfortable. Your code does not exist in a vacuum.
Then again, you don’t exist in a vacuum either.
A selected bibliography of the talk can be viewed at http://is.gd/c4l15yoose.
Presented at Open Repositories, June 11, 2014
This rant focuses on the issues faced by many GLAM organizations who do not feel represented in the mainstream open source repository community, and calls on the leaders of that community to consider what can be done to make the community more inclusive. There are examples of smaller organizations gathering resources to create a supportive community around specific open repository products, one of which will be highlighted in the talk. In general, however, there is very limited support from the greater open repository community for smaller institutions in terms of participating in the community, and the main goal of this rant is to get those who are in a position in the community to consider what can be done to make the community more inclusive and sustainable to all organizations.
Poster: Using Open Source Tools to Improve Access to Oral History CollectionsBecky Yoose
Presented at the Library Technology Conference 2011 in St. Paul, MN.
Program Description: Oral history collections provide a wealth of information, yet current practices in metadata creation and
access limit the amount of information within the interview transcripts that can be discovered. This
poster describes the Miami University Libraries current project of using Open Source Software in
creating enhanced access to our Oral History collection. The Oral History Project at Miami University contains over 100 interviews pertaining to experiences at the University, with transcripts for over half of
the interviews. The poster will describe the process of batch processing transcripts using OpenCalais, a web service that automates the creation of metadata for content using natural language processing and machine learning, and displaying both the transcripts and metadata in the content management system Drupal using various modules. We will discuss the results from the comparison of machine generated
and human generated metadata in this project and the benefits and concerns surrounding both methods. Future project developments will also be included.
Taming the communication beast: Using LibGuides for intra-library communicationBecky Yoose
Presented at the Library Technology Conference 2011, St. Paul, MN.
Program Description: Black box. The back room. Behind closed doors. Such is the life of a Technical Services department. The literal and figurative disconnect between TS and the rest of the library can cause confusion and misinformation among library staff which, in turn, decreases the quality of service the library provides to its patrons. How can we use technology to overcome this disconnect? Many librarians have been successful using LibGuides as a tool to disseminate information from librarians to students and researchers. At Miami University, the TS department decided to experiment by creating a LibGuide to exchange information between TS and other library staff. By incorporating the department s WordPress blog, Google forms and spreadsheets, and RSS feeds into the LibGuide, we have created a place for dynamic interchange of essential information among library staff. The TS department uses the LibGuide to distribute information about our departmental policies and
procedures, as well as e-resource issues and statistics. All library staff can access this information as well as interact with TS staff by using the LibGuide to submit OPAC error reports, e-resource access problems, and report requests. This presentation will explore the planning, building, and implementation of the LibGuide and its evolution since its launch. There will be discussion on the pros and cons of the approaches tried, and on the considerations a department must take into account before building an online presence for intra-library communications.
Presented at ALAO, October 29, 2010.
Program Description: Separated by principle or physically from the public side of the library can prove challenging for Technical Services departments in communication and interaction between the department and the rest of the library staff and patrons. How can Technical Services departments overcome these disconnects? This presentation will survey the different online efforts by several TS departments to facilitate communication with other library departments and beyond, including Miami University’s Technical Services LibGuides page.
Pack Light: Changes in Technical Services Staffing & WorkflowBecky Yoose
The document discusses changes in technical services staffing and workflows at libraries. It recommends eliminating low-value tasks like manual transcription, item-by-item decisions, redundant systems, and procedures that introduce error. Some specific suggestions are to stop routing print journals if they are available online, automate repetitive tasks using macros and scripts, and review workflows to identify those that are no longer needed. The talk encourages focusing resources on critical tasks and leveraging automation through batch processing and third-party applications.
Using AutoIt for Millennium Task Automation HandoutBecky Yoose
Handout to IUG presentation, 4/20/10.
Description: Side one: List of questions to keep in mind while looking for an automation tool. Also contains several automation products, with cost and web address. Side two: Summary of some of the uses of AutoIt at Miami University Libraries. Includes links to screencasts, scripts, and documentation.
Using AutoIt for Millennium Task AutomationBecky Yoose
Co-presented at the Innovative Users Group Conference, April 20, 2010.
Program Description: Many libraries already use program-specific macros to automate tasks in Millennium, OCLC, and other programs. While these macros help with task automation, they are limited to their specific programs. Several libraries have turned to 3rd party automation tools, including AutoIt, in efforts to automate tasks in Millennium and other programs. AutoIt is a Windows automation scripting language that has seen successful implementations at various libraries. This presentation will explore the various ways AutoIt is utilized in academic and public library settings as well as the benefits and consequences of using AutoIt for Millennium task automation.
Handout for the ALAO 10/30/09 presentation.
Presentation description: Overview: What can technical services departments do to tackle automation, training, and communication issues when they have been doing more with less for years? This presentation will cover several free (or inexpensive) and easy to implement tools that Miami University’s Technical Services department has incorporated into daily operations. Specific software, including wikis, macros, and tutorials will be discussed along with quick ways these tools can help technical services address the above issues.
Presented at ALAO, 10/30/2009.
Presentation Description: Overview: What can technical services departments do to tackle automation, training, and communication issues when they have been doing more with less for years? This presentation will cover several free (or inexpensive) and easy to implement tools that Miami University’s Technical Services department has incorporated into daily operations. Specific software, including wikis, macros, and tutorials will be discussed along with quick ways these tools can help technical services address the above issues.
Presented as a HANDS ON! session at Eastern Great Lakes Innovative Users Group 2009 conference. I will add the links to the example scripts once they are loaded to the conference web site.
Program description:
AutoIt is a freeware Windows automation program that has helped many libraries automate Millennium workflows. Its flexibility and power allows users to build scripts that go well beyond automating keystrokes in cataloging, ordering, and in other processes. This hands-on session will introduce the participants to the basics of AutoIt programming (such as variables, conditional statements, and functions) and provide several examples where AutoIt has been utilized in Millennium workflows. The session will be geared towards those who have limited programming or OCLC Connexion macro scripting experience.
This document provides an overview of AutoIt, including installation notes, a comparison of its features to another language, types of functions available, and some commonly used commands. It highlights that the current version is 3.3.0.0 but there is a bug in the help file. It also lists some references for additional AutoIt information and resources.
But I'm Not A Techie! Technical Tools for Technical ServicesBecky Yoose
It looks like Slideshare had some issues with some of the presentation formatting. Otherwise, enjoy.
Presented at WAAL 2009.
Program description:
What can technical services departments do to tackle ongoing issues surrounding automation, training, and communication? Many perceive that, because of lack of financial resources and computer skills, technological tools are out of their reach. While true for some, many other technological tools do not require deep pockets or computer expertise to use. This presentation will cover several free (or inexpensive) and easy to implement technological tools that the Technical Services department at Miami University has incorporated into its daily operations. Specific software, including wikis, macros, and tutorial creation will be discussed along with quick ways these tools can help address the above issues.
2. Presentation Outline
• Overview of Data Management Principles, Policies,
and Practices
• Balancing Library Strategic Needs with Patron
Privacy
• SPL Methods of Data De-identification
• Summary
4. ALA Data Management
Guidelines
• Collection of personally identifiable information
• only when necessary to fulfill the mission of the library
• Should not share personally identifiable user
information with third parties, unless
• the library has obtained user permission
• has entered into a legal agreement with the vendor
• Make records available [to law enforcement
agencies and officers] only in response to properly
executed orders.
“An interpretation of the Library Bill of Rights.”
http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/privacy
5. State Law: Revised Code of WA
• RCW 42.56.310: Library Records
• Any library record, the primary purpose of which is to
maintain control of library materials, or to gain access to
information, that discloses or could be used to disclose
the identity of a library user is exempt from disclosure
under this chapter.
• RCW 19.255.010: Disclosure, notice — Definitions
— Rights, remedies.
• First & last name combined with SSN, DL #, credit/debit
card number, authentication credentials, “account
number”
6. SPL Confidentiality of Patron Data
• It is the policy of The Seattle Public Library to
protect the confidentiality of borrower records as
part of its commitment to intellectual freedom.
• The Library will keep patron records confidential
and will not disclose this information except
• as necessary for the proper operation of the Library
• upon consent of the user
• pursuant to subpoena or court order
• as otherwise required by law.
The Seattle Public Library. “Confidentiality of Borrower Records.”
http://www.spl.org/about-the-library/library-use-policies/confidentiality-of-borrower-records
7. SPL Data Management Practices
• All records connecting a patron to an item that has
been held or borrowed, or to an information
resource that has been accessed, are deleted upon
the successful fulfillment of the transaction.
• Circulation records
• Public computer reservations
• Workstation use data (log files, caches, histories)
• Network logs
8. NIST: Two-part Definition of PII
1. Any information that can be used to distinguish
or trace an individual‘s identity, such as name,
social security number, date and place of birth,
mother‘s maiden name, or biometric records
2. Any other information that is linked or linkable to
an individual, such as medical, educational,
financial, and employment information
a) Libraries extend the second point by including
borrowing and information seeking activity
National Institute of Standards and Technology via the Government Accounting Office
expression of an amalgam of the definitions of PII from Office of Management and Budget
Memorandums 07-16 and 06-19. May 2008, http://www.gao.gov/new.items/d08536.pdf
12. “Unanswerable” Questions
• Longitudinal questions (e.g. trended analysis)
• Long-term (multi-year) rather than snapshot
• Trends, correlations, changes in behavior – not
necessarily individual activity
• Questions only about the type of behavioral use of
Library programs and services:
•“Do teen patrons remain active in their 20s?”
•“Do people use their neighborhood branch or
use the branch where relevant materials are?”
(e.g. Chinese language collection)
13. Privacy Requirements
• Passionate commitment to intellectual freedom
• Recognition that some patrons have no alternatives
• Intellectual content of transactions should always
be purged
• Avoid keeping records that show person’s
whereabouts
14. Threats to patron privacy
• Law enforcement
• Seeking intellectual pursuit data
• Seeking patron whereabouts
• Hackers
• Library and ILS data
• Data leak
• Reconstruction of identity via data
• Embarrassment/loss of trust
• Notification costs
15. AOL data release (2006)
There was no personally
identifiable data provided by
AOL with those records, but
search queries themselves
can sometimes include such
information.
This was a
screw up
TechCrunch. “AOL: ‘This was a screw up’.” August 2006. http://techcrunch.com/2006/08/07/aol-this-
was-a-screw-up/
16. AOL Example: User 4417749
numb fingers
60 single men
dog that urinates on everything
robert arnold
marion arnold
john arnold georgia
homes sold in shadow lake
subdivision gwinnet county georgia
landscapers in Lilburn, GaThelma Arnold
17. Target(ed) Marketing
The Incredible Story Of How Target Exposed A Teen Girl's Pregnancy
http://www.businessinsider.com/the-incredible-story-of-how-target-exposed-a-teen-girls-pregnancy-2012-2
18. Data for Good
In a 2012 Pew Research survey,
64% of respondents said they would be
interested in personalized online
accounts. 29% would be very likely to
use the services.
Last 30 days for SPL BiblioCommons opt-
in for borrowing history is 34%.
“7 Surprises About Libraries in Our Surveys”, Pew Research, 2012
http://www.pewresearch.org/fact-tank/2014/06/30/7-surprises-about-libraries-in-our-surveys/
20. Data De-identification
• Designed to protect individual privacy while preserving
some of the dataset’s utility for other purposes.
• Make it hard or impossible to learn if an individual’s
data is in a data set, or determine any attributes about
an individual known to be in the data set.
• HIPAA: Data that does not identify an individual and
with respect to which there is no reasonable basis to
believe that the information can be used to identify an
individual
Garfinkel, Simson L. “De-Identification of Personally Identifiable Information.” April 2015.
NIST. http://csrc.nist.gov/publications/drafts/nistir-8053/nistir_8053_draft.pdf
Scott Nicholson and Catherine Arnott Smith, "Using Lessons From Health Care to Protect The
Privacy of Library Users: Guidelines For The De‐identification of Library Data Based on HIPAA,"
Proceedings of the American Society for Information Science and Technology 42, no. 1 (2005):
1198–1206.
21. SPL Methods of Data De-identification
Millennial Project
summary
28. Timestamps vs. dates
• Timestamp
Mon, 11 Apr 2016 11:02:43 +0000
• Date
4/11/2015, 00:44
29. Extract-Transform-Load
Data PII?
Barcode Yes
Name Yes
Address Yes
Email Address Yes
Phone Number Yes
Date of Birth Yes
Age No
Gender No
Zip Code No
Registration Year No
Data III?
Barcode Yes
Title Yes
Author Yes
Call Number Yes – truncate it
Item Type No
Branch No
Date No
Age Gender Zip Reg
Year
Item Type Dewey
100
Branch Date
45 Male 98117 2004 CD 700 CEN 4/1/15
45 Male 98117 2004 Book FIC BAL 4/3/15
Patrons Circulation
31. Belt & suspenders
• To identify a specific patron’s transaction, you’d
need to
• Breach ILS
• Recreate hash algorithm
• Breach data warehouse
• Look up patron
• Even then
• No intellectual content or whereabouts
• Only the fact of types of transactions
• Strict, clear policies for staff
33. Express Computer Policy
Are patrons abusing 15-minute “Express”
workstations?
Old policy
• 90 minutes per day for “Internet” workstation
• 15 minutes per day for “Express” workstation
• Total of 90 minutes per day for any workstation
• Allowed for “serial” use of Express workstations (6x per
day)
• Staff noticed (anecdotally) that “a lot” of patrons were
chaining Express sessions together
36. Data Warehouse Summary
• Store identifiable and non-identifiable information
in separate locations
• Avoid storing any intellectually significant or
identifiable data
• Build internal data stores that helps segment
anonymized patrons via behavior and
demographics
• Use this Library owned and managed data store as
the source of information for external CRMs
38. Thank you
Becky Yoose
Library Applications and
Systems Manager
@yo_bj
Stephen Halsey
Director of Marketing
and Online Services
@halseyhoff
The Seattle Public Library
[Stephen]
Good afternoon, thanks for joining us especially with so many good session choices right now AND of course this close to happy hour.
First introductions….
Becky Yoose is our Library Applications and Systems Manager and has recently started work at SPL and is now overseeing development of our data warehouse initiative.
I am SPL’s Director of Marketing and Online Services and I’ve been with the system almost 3 years.
Our data warehouse program started a couple of years ago.
A few people have worked on this presentation and the data warehouse who have since left (and come back).
I stepped in to assist Becky with this presentation. My last day is now tomorrow.
Becky and I now affectionately refer to the curse of this presentation, which I believe as I was stuck in the elevator earlier today briefly!
When I started 3 years ago and I was building the marketing department and program, I discovered that the Library was regularly purging a treasure trove of useful data that could be used for insights into patron behavior.
I chatted with our Director of IT to see what we could do about this and the data warehouse initiative was born.
So, this session is about what SPL has done in the past couple of years to de-identify data to gain better insights into our patrons’ behavior without sacrificing their privacy.
I want to note that the same principals and privacy goals are in place regardless of whether you are using one of the many CRM’s now available OR if you are compiling trended data via Excel or scribbling it on a piece of paper and using an abacus
[Stephen]
[Stephen]
At the national level, libraries are guided by the Library Bill of Rights and the interpretations thereof by ALA.
On the matter of personally identifiable data, ALA advises libraries to collect only what is necessary to fulfill the mission of the library.
We of course, argue that deriving patrons’ needs and wants from the data trails they make in our systems IS a necessary part of fulfilling our mission
ALA further advises that we should not share personally identifiable information (or PII) with 3rd parties unless we are authorized to do so and/or we have agreement in place with 3rd parties to protect that data.
Finally, ALA recommends that we comply with law enforcement requests for data, but only if they are valid.
[Stephen] On the state level in Washington, there’s nothing that explicitly protects the privacy of library patrons or mandates the confidentiality of borrower records.
What we do have that’s useful in this context is an exemption from state records retention requirements for library records and a definition of “personally identifiable information” which we will talk more about in a minute.
[Becky] At SPL, our policy is probably similar to yours. We treat borrower records as confidential, meaning we won’t disclose them except under certain circumstances.
[Becky] In practice, we take advantage of our exemption from public records retention and delete records as soon as they cease to be useful to us. To be more specific, we sever the connection between who you are and what you borrowed as long as you return it to us. We also sever the connection between who you are and what you did on a public computer. We do not capture or store information about what movie you’re streaming right now.
Again, this is similar to how most libraries handle patron data. After these transactions are completed, we still know who you are and where you live – we keep a record of your account, for example. And we still know that “The Painter” by Peter Heller was checked out 127 times last week; nonetheless, we don’t know that you were the one who checked it out.
[Becky] Let’s come back to Personally Identifiable Information. Earlier we saw how Washington state defined PII: name, SSN, DOB, account number, and other details about your that can help someone uniquely identify that you are you.
That approach only looks at one facet of PII, however. The National Institute of Standards and Technology by way of the Government Accounting Office define PII using essentially 2 facets – stuff about you, and stuff that you do that is linked to you. NIST specifically lists medical, educational, financial, and employment information. This kind of data, in sufficient enough quantities, can be used in certain circumstances to reverse engineer an identity. I might not know your name, but if I know all the classes you’re taking at the university and I get access to course rosters, I can, by process of elimination, narrow down or even determine precisely who you are.
[Becky] So we have two important facets of PII to consider whenever we approach a problem such as the one we’re here to talk about today.
PII-1 can be seen as data referring to a person,
and PII-2 is the data about, in the case of libraries, someone’s intellectual pursuits.
In our context, PII-2 includes the books that patrons read, search queries they execute, and the web sites they visit. This data is extremely valuable and is collected all the time by companies - it helps determines what Amazon recommends what should you read or, more importantly, what to buy and what Facebook suggests that you should “Like” so that they can market more products at you. Libraries should be sanctuaries from this kind of detailed data gathering that has become commonplace in the information age… yet we also need to gain insights and intelligence into the rapidly changing and evolving needs and wants of our patrons so that we can better serve them and continue to be a vital resource to our communities.
[Stephen]
Again, from the perspective of Marketing and Public Services, we want to use this data to learn:
How to effectively allocate resources (both financial and human) in the
Marketing of our programs and services to ALL audiences (including non-users, disengaged and underserved audiences)
Tracking effectiveness of marketing efforts
Development of our programs and services to what our patrons desire and expect
Tracking success and satisfaction of our patron’s engagement with our programs and services
ALL while balancing library strategic needs with patron privacy
[Becky] At the beginning of the project -
We have a lot of data about what’s in our collection and the “behavior” of our collection independent of who is using it. For example, we know that a certain title has circed 231 times, or that items in the 910 call number range are popular.
We have some data about our users (patron account records) but nothing about their completed activities.
We make decisions about collections, services, facilities, etc. based on general demographics of a neighborhood, but not based on actual library use by residents.
We do not know if services are used a lot by a few individuals or used somewhat by a lot of individuals.
In short, we have data, but we could not use the data in a way that would both respect user privacy while at the same time use that data to inform where library resources should go in terms of programs and services.
[Stephen]
This leads into unanswerable questions that are essential for us to answer if we want to allocate resources effectively. This includes identifying and serving underrepresented patrons and our non-users in general.
And it means keeping data for more than a year.
A couple of questions to illustrate the above point:
People in their 20s (aka Millennials) are generally not active users of the public library,
We see a lot of teens and children, and then people come back when they become parents themselves. (I call this the church effect)
For the patrons who *are* active in their 20s, had they been active users in their teens?
We also have a high circ of materials in certain languages in certain branches, for example, Chinese-language materials at our International District branch.
What is the zip code of patrons who check out those materials—is IDC their home branch or are they traveling from elsewhere in the city to get those materials?
In Seattle, we are growing incredibly fast, there are shifting demographics from neighborhood to neighborhood that make understanding the neighborhood information even more important
[Becky] When thinking of ways to solve this problem, we had to consider the following:
SPL is very committed to free inquiry & privacy.
Many users can’t “opt out” of using our service. With several online services you can say “Well, if you don’t like the terms of use, just don’t use it or use some other service you like better.” We recognize, however, that some of our users don’t have alternatives for services like the internet or the opportunity elsewhere to do their information seeking privately.
As a principle we always want to purge the “intellectual content” of a transaction.
We also want to avoid keeping records that show a person’s whereabouts or specific activities at a certain place or time. This is too much like surveillance.
[Becky] If you’re trying to design a solution, you have to take into account any perceived threats. The one we talk about most is law enforcement or government. Libraries may receive requests or subpoenas for information on specific individuals or locations. In the analysis of risk, we balance the likelihood of something happening versus the harm that would result. These requests from law enforcement may not be very likely, but we perceive that they would inflict considerable harm on privacy and intellectual freedom, and we take them most seriously.
We often hear about organizations who have lost control of their data to hackers; it seems like every week we have a news story about a major data breach. We don’t hold credit card numbers, social security numbers, or other data likely to have substantial monetary value to hackers. A lot of the data we hold, for example, addresses or phone numbers, can be found in a phone book. On the other hand, some patrons may have protection orders, unlisted phone numbers, sensitive jobs, or other reasons that this information must remain confidential. We should always practice good security for our systems, but this is not a special threat for libraries.
Another risky scenario is an accidental leak of data or even an intentional release of data that is *believed* to be scrubbed of personally identifiable information, but which still can be reconstructed.
[Becky] For example - In 2006, AOL released millions of search queries to researchers. AOL believed that they had removed PII from the data they released, but the search logs contained so many specific queries (for addresses, institutions, etc.) that researchers were able to reconstruct specific identities from various search terms that they identify as belonging to distinct persons.
This data may not have been used for the research AOL originally intended but it proved an important point about privacy.
[Becky] The New York Times profiled one of the cases. Some of the terms that belonged to user 4417749 were fairly generic and may only give you hints about the user’s age, marital status, and that he or she was probably the owner of a dog – an incontinent dog. There was also a flurry of search terms that included the same last name. Researchers knew from previous research that people tend to search a lot for themselves and relatives and so they inferred that since so many search terms were for the last name “Arnold,” user 4417749 might be an Arnold. Finally, they found search terms that provided more clues about 4417749’s whereabouts, which happened to be a very small town in Georgia. By process of elimination, then, they were able to pinpoint … Thelma Arnold, a 60-something widowed dog-owning Georgian who owned up to all the search terms that AOL has labeled as belonging to 4417749. The article does not explain if Thelma ever found a treatment for her dog’s frequent urination problem.
[Stephen]
For a marketer, this story of Target is a great one, as it relates to targeted (no pun intended) marketing:
From Business Insider:
Target broke through to a new level of customer tracking with the help of statistical genius Andrew Pole, according to a New York Times Magazine cover story by Charles Duhigg.
Pole identified 25 products that when purchased together indicate a women is likely pregnant. The value of this information was that Target could send coupons to the pregnant woman at an expensive period of her life.
Plugged into Target's customer tracking technology, Pole's formula was a beast. Once it even exposed a teen girl's pregnancy:
[A] man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation.
“My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?”
The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again.
On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
It doesn’t get much more targeted, and some will say, intrusive, than this.
But this is a great example of how specific data analysis can get
[Stephen]
According to PEW:
One of the foundational principles of librarians is supporting the privacy of patrons. Librarians have long resisted keeping or sharing records of the book-borrowing or computer-using activities of their patrons. However, in the age of book-recommendation practices on all kinds of websites, many patrons are comfortable with the idea of getting recommendations from librarians based on their previous book-reading habits.
In a 2012 survey, 64% of respondents said they would be interested in personalized online accounts that provide customized recommendations for books based on their past library activity.
Some 29% said they would be “very likely” to use a service if it were made available by their library.
Indeed, our own statistics show that service is chosen by 34% of the new borrowers over the last 30 days
[Becky] If AOL had recognized, as we’ve discussed, that PII is actually a two-part concept, they would not have been surprised that even if you strip out PII-1, PII-2 may still be used to reconstruct an identity.
[Becky] How can we get to the sweet spot where we know enough about what patrons are doing with our stuff but not so much that we lose their trust or risk violating the agreements we have with them? We now turn to the concept of data de-identification. This is a very tricky thing to get right. A NIST white paper outlines definitions, approaches, and considerations for a data de-identification approach. The concept of de-identifying data also exists in the privacy juggernaut that is HIPAA – in healthcare, analyzing patient data is critical for developing new treatments and new approaches; nevertheless personal identity is sacrosanct and must be protected at all times. In the article “Using Lessons From Health Care to Protect the Privacy of Library Users: Guidelines for the De-Identification of Library Data Based on HIPAA,” the authors enumerate the specific data elements that are considered to constitute PII according to HIPAA, which can be used a general guideline for looking at library borrower data. While the original project leads were not aware of the article at the time, both the project and the paper share similar approaches to PII in libraries.
[Stephen]
Here is an example of one of those paper and Excel and abacus methods…and referring to those largely in their 20s
This was a grant funded project by the Allen Foundation in 2013 and 2014
Goal was to re-engage the Millennial population
But if we didn’t in some way collect patron data, in this case identifiable by age and borrowing statistics, we could not have known whether we were meeting the goals of the project and fulfilling the data reporting requirement of the grant
[Stephen]
In this case the oal was to increase usage of Millennials by 15% over the summer of 2014 through targeted marketing and programs
One way to provide better targeting is through segmentation
We engaged an agency to perform the research
5 subsegments were created from the Millennial demographic
Programs were then created based on the market research
Metrics were tracked on activity for all those cardholders that fell into the Millennial segment defined by those ages 18-30 at time of grant
We developed our successful pop-up library program as well a successful cooking program based on this market research and we actually met our goals for the grant
[Stephen]
This is an example of one of the personas that was created as a result of this market segmentation
De-identified data gets collected and was then mapped to personas that were created during the project
Aruna Kumar is a persona representing this cluster of users, in this case, digital only
[Stephen]
Horizon database had exports of usage of a few key items
The database had flags based on age
New Borrowers
Boopsie Logins
Checkouts
Freegal Downloads
Overdrive Downloads
We were able to track usage with this method
We’d get weekly snapshots and then we would trend the data in Excel
This though was the lead up to the better method of de-identification that Becky is about to walk you through
[Becky] The Millennial Project, in hindsight, was the first iteration of the Data Warehouse that we are now currently building. We knew that this data was valuable and we proved that by using this data we could achieve real world impact. Nonetheless, we needed to evolve the Warehouse to balance patron privacy and data analysis. This next section will discuss how we are dealing with both PII-1 and PII-2 in our data de-identification scheme.
[Becky] Let’s start with PII-1 information. Using the age vs. date of birth is a simple example of how you can keep data that is just as useful for us that significantly reduces the risk of identifying a patron.
Say we have the information that a non-fiction DVD was checked out by a patron. We can record that the patron was 41 years old on April 1, 2016, or we can record that their date of birth was March 15, 1975. Which is better? The age is. A person who is 41 years old on 4/1/16 could have been born on any date from 4/2/1974 to 4/1/1975.
There are a lot more people whose birthdays fall in that yearlong span than there are people whose birthdays are specifically on 4/1/75, so using the age is a lot “blurrier” and less identifiable. But if we’re interested in knowing what kinds of materials appeal to different ages, the age is just as good as the DOB.
[Becky] We’re also interested in hitting the right granularity of information about our materials. Using the full call number is obviously far too specific. There are many items in our collection that would have a unique call number, which would be just as bad as us saving the title of something you checked out.
The format goes too far in the opposite direction—it’s so vague that it gives us hardly any useful information about what materials are popular. For example, recent popular movies, foreign-language TV shows, documentaries, and instructional DVDs could all appear to have the same format. If all these are lumped together, we’d probably just see that DVDs are popular among all groups, which is not very useful.
The collection code seems promising, but as we have them set up, it is also too specific. Some collections, like adult fiction or even adult sci-fi/fantasy are quite large. But other collections are as specific as “beginning ESL.” This is too much “intellectually identifiable info” as we think of it, about what the patron is interested in.
Truncating the call numbers appears to be the best match for what we want. This gives us separate buckets for adult fiction or YA fiction, and Dewey decades or centuries, but wouldn’t boil down to individual items.
[Becky] Our everyday systems keep track of the exact time and date of an action. For example, a patron used a computer from 11:02am to 11:46am on April 11. We have no use for this level of detail in our analysis of patron behavior. We’d be interested to know that someone used a computer for 44 minutes on a Monday, though. By keeping the session length and the date, we can achieve that while not maintaining a record of someone’s whereabouts at a specific time and place.
You might think we’d be interested in knowing how many sessions start between 10 and 11am or 11am and 12pm, and so forth. That’s true, but we don’t need to know about the patrons who are doing those sessions at the same time, so that kind of data doesn’t belong in *this* database, our database for longitudinal analysis.
[Becky] So, what is the database we are building? In creating a data warehouse, you do something called ETL, which is extract, transform, and load. You take the data from the operational database where it was being collected, you do whatever you need to change it into the kind of data you want to store for analysis long term, and then you load it into where you’re going to store it.
Here you can see a sample of the data that is stored in the ILS, data about a patron and an item that has circulated.
[x] This is stored in separate tables.
[x] Here you can see the data that is not personally or intellectually identifiable that we want to move. In the case of the call number, we have to transform it as we move it.
[x] Here is it, transformed and pulled together into records about each transaction.
[Becky] The most secure thing for us to do would be to record all of the transactions with the age and home branch of the patron, and then you would have no connection to individual patrons. The problem with this is that it prevents us from answering some of our most important questions, like “do people who check out ebooks still use print?” We need to be able to match up those transactions—but we just need to know that person A is person A and person B is person B, not that John Smith is person A and Jane Doe is person B. That is essential for longitudinal analysis.
The best way we’ve found to do this is to record the borrower ID of the person who did the transaction, but to obfuscate it using some hashing algorithms, with a pinch of salt… while we are not making corned beef hash, it is a good representation of what we are doing. A hash is a way to ensure that borrower number 12345 is always turned into the same disguised ID, but there is no way to calculate it in the reverse. Currently we are working with SHA-1 due to system restraints but are currently evaluating SHA-2’s compatibility within the warehouse.
This isn’t bulletproof, because if you had access to our ILS database, you could repeat this procedure, to learn the disguised version of that patron’s ID.
However, this attacker would have to gain access to multiple, completely separate databases to pull that off.
https://www.flickr.com/photos/accidentalhedonist/5707034712/
[Becky] What this adds up to is a “belt and suspenders” approach, or an approach that has redundant failsafes baked into it.
If you wanted to use our system to gain access about a specific patron, you’d first need to breach our ILS to find that patron’s identity and ID. Then you’d need to somehow recreate the hash algorithm we’d used to find out how that patron is represented in our data warehouse, and then breach the data warehouse, and then look them up there. Not to get computer science-y on you, but assuming you could even manage the part with the hashing, you’d still need to break into two databases, and this is our lower risk scenario!
Okay, what about our government opponent who can force us to give them access and identify the patrons to them? Even then, remember that we have no details about titles, specific transactions, websites anyone visited, exact times or places that things happened. We only have a record of the fact that certain kinds of transactions occurred.
And what about leaks? We do not plan to wholesale share this information with anyone without first going through a rigorous vetting process, and staff already have strict, clear policies about how, when, and why they can access patron data—these policies are already in place for our ILS and would apply equally to this data.
[Becky] We are still in the early stages of the Data Warehouse; we have a few datasets in there, including door counts, computer session usage, branch information, and some preliminary ILS circulation and patron demographic data. Nonetheless the data we do have in the Warehouse has already proved valuable.
[Becky] Our libraries have both “Public Internet” workstations, which can be reserved for up to 90 minutes per day, and “Express” workstations, which can be used for 15 minutes without prior reservation. Patrons are permitted up to 90 minutes of computer use per day regardless of the type of workstation they use.
Library staff began to report that patrons were “frequently” logging on to the 15-minute Express workstations multiple times in succession, to the detriment (and annoyance) of patrons who were waiting patiently for a computer to open up. It seemed that some patrons had figured out that they could chain together up to six 15-minute sessions at the Express computers. This was obviously not the intended purpose of the Express stations. However, all we had to rely upon was anecdotal information from a few locations. With our traditional data collection practices, we had no way of telling if any given array of six sessions on an Express workstation were done by the same patron or by six different patrons.
[Becky]
Before De-identification, we only had the total number of sessions and minutes that Express workstations were used per day per branch The de-identified data, however, gave us our answer. Since we now had public computer session data recorded along with a patron’s “DeID,” we could easily tell that, indeed, many, many patrons at all library locations were logging into Express workstations 6 times per day for 15 minutes per session. We couldn’t tell specifically who the patrons were, but we could tell when it was the same patron logging on 6 times versus when it was 6 distinct patrons each logging on for 15 minutes.
[Becky]
The new way of collecting data allowed us to determine that this problem was, indeed, rampant. Over 25% of patrons used Express workstations more than 15 minutes per day. We made the decision to change the way our system handled public computer sessions. Now, patrons get both 90 minutes per day on reservable public workstations but only 15 minutes per day on Express workstations. This essentially means that patrons qualify for a total of 105 minutes per day on our computers, but it was decided that this was a fairer and more equitable way to allocate time given the abuses we were seeing of the Express workstations.
[Becky]
In summary, we are still in the early days of the Data Warehouse; however we are certain that we are going in the right direction in terms of its architecture and our methodology to de-identify patron information. Storing different types of information in different locations while avoiding keeping any PII-1 or 2 data helps mitigate the risks of storing such data in house. At the same time, we have enough level of detail that the data we have can be used to impact real world operations. Finally, we have the control over what we provide third party customer relationship management systems or other external systems.
[Becky] We hope to have shown here that there is not necessarily a clear-cut conflict between patron privacy and being able to make data-driven decisions. If we are clever about it, we can maintain the promises we’ve made to patrons about their privacy while keeping enough “blurry” data to get some value out of it. It doesn’t need to be a data-driven approach VERSUS patron privacy, it can be data-driven PLUS patron privacy.
[Author Comment - we moved this slide out of the deck in case we will not have room to talk about this historical issue; nonetheless, for those of you playing along at home, this is a reminder that we also have physical security vulnerabilities to address as well.]
In discussing this issue, we’ve also discovered that our historic practices are far from perfect.
In discussions with several security researchers in academic computer science, they quickly spotted serious flaws in our systems. One is example is our holds system. A patron places a hold on an item. Usually we have a record of their intellectual pursuit for the three-week period of their checkout but sometimes you wait for something on hold much longer than that. We’re holding a record of their intellectual inquiry that is vulnerable to law enforcement or government demands, sometimes for months at a time. Then we send them a plaintext email about that request and put the item on the shelf with an tag that only shows four letters of their last name and three of your first name. Of course, if you’re my colleague KAY AOKI, that’s your whole name. The item is on an open holds shelf that anyone can wander by.
One security researcher had an off-the-cuff idea to improve this process. He said we should give each request a QR code that tracks it and is required to pick it up. Then the request is associated with that code and NOT with a patron record. This would be very secure! The idea has a few flaws though. First, there would be no way to contact the patron to tell them that their item is ready because we don’t know who they are. They’d just have to check for it periodically. Second, the patron would have to NOT LOSE their QR code in order to claim the item. Obviously, both of these assumptions are unrealistic from a customer service point of view. Third, it’s a QR code.
Maybe this idea could be refined and become more practical, but we liked this story as an example of how real-world customer service and operations are sometimes incompatible with ideal solutions.