Data mining is a way to make sense of large datasests. It borrows theoretical underpinnings from statistics as well as computer science, allowing us to generate new insights and knowledge. Data mining is very useful in many capacities, and it is increasingly easy to generate useful models to predict, classify, and describe new information. The results of data mining analytics can be utilized in administrative decision making, understanding user behavior, and identifying appropriate resources and services to meet the needs of customers or library patrons. We don’t have time today to get into the nitty-gritty of how all of these algorithms and models can be implemented, but I wanted to show you today some of the possibilities afforded to libraries and librarians through the use of data mining techniques.
Data mining is part of a larger process known as “Knowledge Discovery in Databases” or KDD. Essentially, first, we have a dataset, or input data. Then we preprocess it – which means to make it ready for analysis. This could mean changing the format of the data, throwing out incomplete data, converting the data into the proper format for analysis. Then comes the data mining, which we will focus on in a minute. After data has been mined, analyzed and our conclusions identified, we create visualizations and present the results in a way that is understandable and makes sense. This is known as postprocessing. Finally , after all of these treatments, we have moved from raw data to useable, actionable information and new insights.
Before I get into the meat and potatoes of data mining, I would like to spend a moment on the importance of data quality. It is of paramount importance not only to collect as complete and accurate data as possible but then we need to make certain the dataset is scrutinized for errors and omissions, typographical errors, formatting considerations. An example would be the value zero in a dataset. This could mean the mathematical concept of zero (Numerical), no value entered (Null), or even No to a yes/no type of question(non-mathematical or nominal). We’ve heard for years the acronym GIGO: garbage in, garbage out. It is vital that we take the time to get rid of the garbage in the preprocessing and formatting stages of our knowledge discovery process. The data I am using for the library related examples in this presentation are from data collected last fall through a user survey of graduate students at the University at Albany’s Downtown Campus.
So, moving on to tools for Data Mining. For my current project with the grad student library user survey, I am using an open-source analytics tool known as Weka. It is free for downloading at the link above. I will focus on this tool, because it is the one with which I am most familiar, but there are other similar tools including RapidMiner, sciKit; and the statistical application R also has many data mining capabilities. If a researcher has a reasonably solid background in statistics she will find basic functionality in WEKA easy to grasp. I recommend the book Your Statistical Consultant as a reference, as well as Data Mining for the Masses. At the end of this presentation, there will be links to help you locate these books. (DMM is a free PDF, the other costs money). Next we will talk about the major types of data mining models and how they can be used. The main types of data mining tasks are: Prediction, Classification, and Association. My aim is to show you some possibilities and pique your interest to learn more!
Prediction is exactly what it sounds like: we hope to reliably determine the value or outcome of a variable (known as the target variable), based on the values of other variables in the dataset (known as the explanatory variables). There are several ways to do predictive analysis, but the one I am going to show you today is known as a decision tree algorithm. What a decision tree does for us is to ask a series of questions in a heirarchical format, not unlike a flow chart. Decision trees are easy to interpret. They are resistant to data “noise” which is a term for outliers, less relevant variables, and so forth. The tricky part with decision trees remains the structure and preprocessing of the data, and there is a risk of “overfitting” your model. Overfitting can occur when the decision tree algorithm computes a high accuracy rate with your training set (or model), but does not work as well on test data or new data. On the next slide, I will show you a decision tree of a training set I did to predict the likelihood of a survey respondent to have answered that they frequently use library resources, based on answers to certain demographic questions.
Ok, so this is what my decision tree looks like. I have a simplified version of a few of the branches on the next slide so you can see how this works.
Here is a piece of the decision tree made prettier (sort of) through Microsoft Shapes. Hopefully it will make more sense to you. What we are looking at are the questions the decision tree asks in order to predict the likelihood of a student at my library who uses library resources. The first question it asks is: Did the student use email or Instant message reference. There is a branch for “YES” and a branch for “NO”. Let’s follow the right side for a minute. If they did NOT use email or IM reference, the next question it asks is about the student’s residency and full time/part time status, and there are 3 options for this variable. Be aware, each of those options has more branches below it in the real tree I showed you on the last slide, so the probabilities are not calculated. Going back up to the top, let’s follow if they answered “YES” to using electronic reference. The next question the tree asks is… Did the student attend any library workshops? And we have the values: none, 1-2 sessions, and 3 or more sessions. If the student took one or two sessions, the next criteria that matters is how much time the student took between graduate studies and undergrad. I should add that the other numbers of sessions also have lower branches, we are trying to simplify by following the shortest trail of branches. The interesting feature of the time between grad and undergrad is that those who have taken any sort of break are practically guaranteed to use library resources, provided they received instruction and used electronic reference. Those who do not take a break were less likely despite these library interventions. Hmmmm….
Classification is a way of identifying similarities or patterns in a dataset based on comparable variable attributes in each case. There are a number of ways to do this, but I would like to show you clustering. Clustering is a very visual way of determining patterns in your data. Any cases with lots of similar values in their variables are grouped closer together, those with different values are grouped farther away. What this means is you can inspect the clusters and determine what the similar values are in each case. The similar values give you the pattern of each cluster, which in turn is a way of classifying your data. Unfortunately, my own data did not respond well to clustering, which I will discuss in a minute. For now I will show you a classic clustering example from zoology – predicting animal genus based on physical characteristics of the creature.
I know this is hard to see, but there is a purple cluster, a blue cluster, a brown cluster, a yellow cluster, green cluster. Each of these represents a grouping the clustering algorithm determined based on characteristics of the animals. For example, worms and snakes have no legs, lay eggs, whereas seals porpoises and dolphins are aquatic mammals. The tricky part of cluster analysis is that unlike the decision tree, it IS very sensitive to “noise” in your data, notice that platypus, which is a mammal, is classified with the turtle type of creatures. As mentioned, in my case clustering the graduate student survey data was not particularly successful. This is because #1, my dataset is probably too small and #2 I may have asked the wrong questions or combinations of questions to generate clusters. One thing I intend to do is go back to the preprocessing stage of the data and see if there are ways to group responses to variables that reduce data “noise” and give us some sort of pattern.
Association can be used for classification purposes as well. Association, however is based on “rules” rather than “clusters”. Association rules are if…then rules that show patterns of association variables. This allows for complex comparisons and generates some interesting associations. The famous example of association “rules” is the urban myth that, due to what is known as “marketbasket analysis,” Walmart (or whatever big box store) puts its beer and diapers in the same aisle. So the rule would go: If customers by diapers then they are likely to also purchase beer. The myth goes that this is because the young husbands get sent out to buy diapers and pick up some beer for themselves while they are out. Association rules is also how Netflix and Amazon determine what to recommend to you. Association rules are easy to interpret and describe, and they handle skewed data very well (for example, my survey results were 70% women, 30% men).
Here is what the association rules look like in Weka. I understand that the variable names and values are not particularly descriptive on this screenshot, you need my survey “codebook” to explain what the variables mean and what each value signfies. This run of my association rules algorithm shows that if students are “somewhat” confident in finding the information they need (confidence =4), they are likely off campus, full time students (residency=2). This is interesting because survey gave 4 options: extremely confident, very confident, somewhat confident, and not confident at all. Zero respondents indicated that they were not confident at all. Our least confident students are “somewhat confident” and our least confident students are most frequently full time commuters, as opposed to part timers or full time on campus students. Hmmmm…..
There is, actually a fourth data mining task known as anomaly detection. This is the opposite of something like classification or prediction, it is identifying the outliers which DON’T fit your model. Practical uses for anomaly detection are are: determining credit card fraud (your credit card was just used in Bali, and you are in Rochester) and email spam filtering algorithms. I don’t have a good example of this one, because the goal of my survey was to look for patterns and trends, and also because it works best with so-called “Big Data,” but you can see where this is a useful application in a business context.
So what are some things we can do with data mining techniques to provide better user services, work processes, and administrative decisions?
Like me you could take user survey and use the data to predict and associate certain characteristics with library resource use. (and try to classify!)
You could try to determine the likelihood that a book will go missing by considering various circ stats as the explanatory variables (times circulated, publication year, call number range, etc.)
Cluster the patterns of library use of subject groupings (call number ranges) by explanatory variables such as counts of : interlibrary loans, purchases on demand, circulation of books, journal article downloads
Determine which academic majors or faculty departments are most associated with the use of various services: (reference, borrow a kindle, check out more than 5 books a semester)
I hope this has talk has given you some ideas about the possibilities of what we can learn from mining library data for interesting insights, patterns, and information. Some things to consider – all of this interesting analysis is predicated on GOOD DATA, and getting “good” data may be more challenging than the analysis itself. Privacy concerns may keep us from mining data about our library users, for example, we don’t make circulation data available. But such data in the aggregate can be used to great effect, provided extreme care is taken to protect our users’ identities. Second, we may not currently be collecting the data we need to appropriately tell our stories; we may have to change what information we collect and how we collect it to get the “good stuff.” Do a data inventory of your library! What is missing to help you achieve your strategic goals? And, even if you have data that you’re ready to mine, you may not be ready to do the mining yourself. But now that you know what possibilities exist, why not ask around on campus for help? Computer science and statistics students may welcome the opportunity.
Data Mining for Libraries
Data Mining for Libraries:
What are the Possibilities?
Elaine M. Lasda Bergman, MLS
Subject Librarian for Social Welfare
University at Albany, SUNY
SUNYLA Midwinter Conference
January 30, 2015
What is Data Mining?
Knowledge Discovery In Databases
Data Mining Postprocessing Information
Adapted from Tan, et al. (2006), p.3
A note about data collection
• It’s the kicker: GIGO
What is Weka?
Weka for Prediction
Mackenzie, Ian: https://www.flickr.com/photos/madmack/165933656/
On campus full
Off campus full
time Part time
Likelihood of graduate
using library resources
based on survey questions
Weka for Classification
How Can Libraries Use Data Mining?
It All Starts With Data Collection
Elaine Lasda Bergman, Subject Librarian for Social Welfare, University at Albany
Tan, P. et al. (2006). Introduction to Data Mining. Boston: Pearson Education, Inc.
Newton, et al. (2012). Your Statistical Consultant: Answers to Your Data Analysis Questions. Thousand
Oaks: SAGE Publications.
Two good Weka Tutorials:
Data Mining for the Masses: