SlideShare a Scribd company logo
1 of 16
Data Mining for Libraries:
What are the Possibilities?
Elaine M. Lasda Bergman, MLS
Twitter: @ElaineLibrarian
elasdabergman@albany.edu
Subject Librarian for Social Welfare
University at Albany, SUNY
SUNYLA Midwinter Conference
January 30, 2015
What is Data Mining?
http://pixabay.com/en/helmet-mine-mining-headgear-155632/
Knowledge Discovery In Databases
(KDD)
Input data
Data
Preprocessing
Data Mining Postprocessing Information
Adapted from Tan, et al. (2006), p.3
A note about data collection
• It’s the kicker: GIGO
• Cleaning
• Preprocessing
What is Weka?
http://www.cs.waikato.ac.nz/ml/weka/
Weka for Prediction
Mackenzie, Ian: https://www.flickr.com/photos/madmack/165933656/
Decision Tree From Weka
Did Student
use Email/IM
reference
Did student
Receive
instruction
0 sessions
1-2 session
Time between
grad/undergrad
1-5 years
100% yes
None
45% yes
5+ years
100% yes
3+ sessions
Student’ s
residency
status
On campus full
time
Off campus full
time Part time
Likelihood of graduate
students
using library resources
based on survey questions
Yes
No
Weka for Classification
http://www.geograph.org.uk/photo/971476
Animal Clusters
Weka for Association Analysis
http://analytics-arena.blogspot.com/2012/12/the-famous-beer-diaper-planogram.html
Association Rules
(Anomaly Detection)
https://www.flickr.com/photos/fonalite/2780198933/
How Can Libraries Use Data Mining?
http://dlg.galileo.usg.edu/dahlonega/dahlonega_logo.jpg
Circling Back:
It All Starts With Data Collection
http://www.navigatingthetension.com/2012/02/circle-wagons.html
Questions?
Me:
Elaine Lasda Bergman, Subject Librarian for Social Welfare, University at Albany
email: elasdabergman@albany.edu
Twitter: @ElaineLibrarian
Resources used:
Tan, P. et al. (2006). Introduction to Data Mining. Boston: Pearson Education, Inc.
Newton, et al. (2012). Your Statistical Consultant: Answers to Your Data Analysis Questions. Thousand
Oaks: SAGE Publications.
Two good Weka Tutorials:
http://www.cs.ccsu.edu/~markov/weka-tutorial.pdf
http://www.uh.edu/~smiertsc/4397cis/WEKA_Data_Mining_Tool.pdf
Data Mining for the Masses:
https://rapidminer.com/wp-content/uploads/2013/10/DataMiningForTheMasses.pdf

More Related Content

More from Elaine Lasda

Your Systematic Review: Getting Started
Your Systematic Review: Getting StartedYour Systematic Review: Getting Started
Your Systematic Review: Getting StartedElaine Lasda
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesElaine Lasda
 
The New Metrics: conference presentation
The New Metrics: conference presentationThe New Metrics: conference presentation
The New Metrics: conference presentationElaine Lasda
 
Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Elaine Lasda
 
Scholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsScholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsElaine Lasda
 
Personal Time Management
Personal Time ManagementPersonal Time Management
Personal Time ManagementElaine Lasda
 
Early Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactEarly Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactElaine Lasda
 
Computers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsComputers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsElaine Lasda
 
Computers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesComputers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesElaine Lasda
 
Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Elaine Lasda
 
Data Literacy for Librarians
Data Literacy for LibrariansData Literacy for Librarians
Data Literacy for LibrariansElaine Lasda
 
UAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantUAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantElaine Lasda
 
Open Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopOpen Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopElaine Lasda
 
Data and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetData and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetElaine Lasda
 
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandAltmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandElaine Lasda
 
From Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsFrom Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsElaine Lasda
 
Open Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdOpen Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdElaine Lasda
 
Research Impact Roadshow
Research Impact RoadshowResearch Impact Roadshow
Research Impact RoadshowElaine Lasda
 
Gaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisGaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisElaine Lasda
 
Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Elaine Lasda
 

More from Elaine Lasda (20)

Your Systematic Review: Getting Started
Your Systematic Review: Getting StartedYour Systematic Review: Getting Started
Your Systematic Review: Getting Started
 
Research Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case StudiesResearch Impact in Specialized Settings: 3 Case Studies
Research Impact in Specialized Settings: 3 Case Studies
 
The New Metrics: conference presentation
The New Metrics: conference presentationThe New Metrics: conference presentation
The New Metrics: conference presentation
 
Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!Maximizing Your Research Impact: 5 Quick Hits!
Maximizing Your Research Impact: 5 Quick Hits!
 
Scholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized SettingsScholarly Metrics in Specialized Settings
Scholarly Metrics in Specialized Settings
 
Personal Time Management
Personal Time ManagementPersonal Time Management
Personal Time Management
 
Early Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly ImpactEarly Career Tactics to Increase Scholarly Impact
Early Career Tactics to Increase Scholarly Impact
 
Computers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly MetricsComputers in Libraries 2018 Workshop on Scholarly Metrics
Computers in Libraries 2018 Workshop on Scholarly Metrics
 
Computers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics FreebiesComputers in Libraries Scholarly Metrics Freebies
Computers in Libraries Scholarly Metrics Freebies
 
Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2Data Literacy for Librarians - Day 2
Data Literacy for Librarians - Day 2
 
Data Literacy for Librarians
Data Literacy for LibrariansData Literacy for Librarians
Data Literacy for Librarians
 
UAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER GrantUAlbany Open Access Day Presentation on OER Grant
UAlbany Open Access Day Presentation on OER Grant
 
Open Educational Resources Faculty Workshop
Open Educational Resources Faculty WorkshopOpen Educational Resources Faculty Workshop
Open Educational Resources Faculty Workshop
 
Data and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheetData and Libraries: How I learned to stop worrying and love the spreadsheet
Data and Libraries: How I learned to stop worrying and love the spreadsheet
 
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the LandAltmetrics & Scholarly Publishing: the LIbrary Lay of the Land
Altmetrics & Scholarly Publishing: the LIbrary Lay of the Land
 
From Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly MetricsFrom Reputation to Citation: Varying Roles for Scholarly Metrics
From Reputation to Citation: Varying Roles for Scholarly Metrics
 
Open Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher EdOpen Educational Resources (OERs): A Game Changer For Higher Ed
Open Educational Resources (OERs): A Game Changer For Higher Ed
 
Research Impact Roadshow
Research Impact RoadshowResearch Impact Roadshow
Research Impact Roadshow
 
Gaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric AnalysisGaining Insights Through Bibliometric Analysis
Gaining Insights Through Bibliometric Analysis
 
Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!Getting "Fancy" With Your Library Data!
Getting "Fancy" With Your Library Data!
 

Recently uploaded

History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxsocialsciencegdgrohi
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsanshu789521
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxAnaBeatriceAblay2
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,Virag Sontakke
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 

Recently uploaded (20)

History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptxHistory Class XII Ch. 3 Kinship, Caste and Class (1).pptx
History Class XII Ch. 3 Kinship, Caste and Class (1).pptx
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Presiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha electionsPresiding Officer Training module 2024 lok sabha elections
Presiding Officer Training module 2024 lok sabha elections
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptxENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
ENGLISH5 QUARTER4 MODULE1 WEEK1-3 How Visual and Multimedia Elements.pptx
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,भारत-रोम व्यापार.pptx, Indo-Roman Trade,
भारत-रोम व्यापार.pptx, Indo-Roman Trade,
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 

Data Mining for Libraries

Editor's Notes

  1. Data mining is a way to make sense of large datasests. It borrows theoretical underpinnings from statistics as well as computer science, allowing us to generate new insights and knowledge. Data mining is very useful in many capacities, and it is increasingly easy to generate useful models to predict, classify, and describe new information. The results of data mining analytics can be utilized in administrative decision making, understanding user behavior, and identifying appropriate resources and services to meet the needs of customers or library patrons. We don’t have time today to get into the nitty-gritty of how all of these algorithms and models can be implemented, but I wanted to show you today some of the possibilities afforded to libraries and librarians through the use of data mining techniques.
  2. Data mining is part of a larger process known as “Knowledge Discovery in Databases” or KDD. Essentially, first, we have a dataset, or input data. Then we preprocess it – which means to make it ready for analysis. This could mean changing the format of the data, throwing out incomplete data, converting the data into the proper format for analysis. Then comes the data mining, which we will focus on in a minute. After data has been mined, analyzed and our conclusions identified, we create visualizations and present the results in a way that is understandable and makes sense. This is known as postprocessing. Finally , after all of these treatments, we have moved from raw data to useable, actionable information and new insights.
  3. Before I get into the meat and potatoes of data mining, I would like to spend a moment on the importance of data quality. It is of paramount importance not only to collect as complete and accurate data as possible but then we need to make certain the dataset is scrutinized for errors and omissions, typographical errors, formatting considerations. An example would be the value zero in a dataset. This could mean the mathematical concept of zero (Numerical), no value entered (Null), or even No to a yes/no type of question(non-mathematical or nominal). We’ve heard for years the acronym GIGO: garbage in, garbage out. It is vital that we take the time to get rid of the garbage in the preprocessing and formatting stages of our knowledge discovery process. The data I am using for the library related examples in this presentation are from data collected last fall through a user survey of graduate students at the University at Albany’s Downtown Campus.
  4. So, moving on to tools for Data Mining. For my current project with the grad student library user survey, I am using an open-source analytics tool known as Weka. It is free for downloading at the link above. I will focus on this tool, because it is the one with which I am most familiar, but there are other similar tools including RapidMiner, sciKit; and the statistical application R also has many data mining capabilities. If a researcher has a reasonably solid background in statistics she will find basic functionality in WEKA easy to grasp. I recommend the book Your Statistical Consultant as a reference, as well as Data Mining for the Masses. At the end of this presentation, there will be links to help you locate these books. (DMM is a free PDF, the other costs money). Next we will talk about the major types of data mining models and how they can be used. The main types of data mining tasks are: Prediction, Classification, and Association. My aim is to show you some possibilities and pique your interest to learn more!
  5. Prediction is exactly what it sounds like: we hope to reliably determine the value or outcome of a variable (known as the target variable), based on the values of other variables in the dataset (known as the explanatory variables). There are several ways to do predictive analysis, but the one I am going to show you today is known as a decision tree algorithm. What a decision tree does for us is to ask a series of questions in a heirarchical format, not unlike a flow chart. Decision trees are easy to interpret. They are resistant to data “noise” which is a term for outliers, less relevant variables, and so forth. The tricky part with decision trees remains the structure and preprocessing of the data, and there is a risk of “overfitting” your model. Overfitting can occur when the decision tree algorithm computes a high accuracy rate with your training set (or model), but does not work as well on test data or new data. On the next slide, I will show you a decision tree of a training set I did to predict the likelihood of a survey respondent to have answered that they frequently use library resources, based on answers to certain demographic questions.
  6. Ok, so this is what my decision tree looks like. I have a simplified version of a few of the branches on the next slide so you can see how this works.
  7. Here is a piece of the decision tree made prettier (sort of) through Microsoft Shapes. Hopefully it will make more sense to you. What we are looking at are the questions the decision tree asks in order to predict the likelihood of a student at my library who uses library resources. The first question it asks is: Did the student use email or Instant message reference. There is a branch for “YES” and a branch for “NO”. Let’s follow the right side for a minute. If they did NOT use email or IM reference, the next question it asks is about the student’s residency and full time/part time status, and there are 3 options for this variable. Be aware, each of those options has more branches below it in the real tree I showed you on the last slide, so the probabilities are not calculated. Going back up to the top, let’s follow if they answered “YES” to using electronic reference. The next question the tree asks is… Did the student attend any library workshops? And we have the values: none, 1-2 sessions, and 3 or more sessions. If the student took one or two sessions, the next criteria that matters is how much time the student took between graduate studies and undergrad. I should add that the other numbers of sessions also have lower branches, we are trying to simplify by following the shortest trail of branches. The interesting feature of the time between grad and undergrad is that those who have taken any sort of break are practically guaranteed to use library resources, provided they received instruction and used electronic reference. Those who do not take a break were less likely despite these library interventions. Hmmmm….
  8. Classification is a way of identifying similarities or patterns in a dataset based on comparable variable attributes in each case. There are a number of ways to do this, but I would like to show you clustering. Clustering is a very visual way of determining patterns in your data. Any cases with lots of similar values in their variables are grouped closer together, those with different values are grouped farther away. What this means is you can inspect the clusters and determine what the similar values are in each case. The similar values give you the pattern of each cluster, which in turn is a way of classifying your data. Unfortunately, my own data did not respond well to clustering, which I will discuss in a minute. For now I will show you a classic clustering example from zoology – predicting animal genus based on physical characteristics of the creature.
  9. I know this is hard to see, but there is a purple cluster, a blue cluster, a brown cluster, a yellow cluster, green cluster. Each of these represents a grouping the clustering algorithm determined based on characteristics of the animals. For example, worms and snakes have no legs, lay eggs, whereas seals porpoises and dolphins are aquatic mammals. The tricky part of cluster analysis is that unlike the decision tree, it IS very sensitive to “noise” in your data, notice that platypus, which is a mammal, is classified with the turtle type of creatures. As mentioned, in my case clustering the graduate student survey data was not particularly successful. This is because #1, my dataset is probably too small and #2 I may have asked the wrong questions or combinations of questions to generate clusters. One thing I intend to do is go back to the preprocessing stage of the data and see if there are ways to group responses to variables that reduce data “noise” and give us some sort of pattern.
  10. Association can be used for classification purposes as well. Association, however is based on “rules” rather than “clusters”. Association rules are if…then rules that show patterns of association variables. This allows for complex comparisons and generates some interesting associations. The famous example of association “rules” is the urban myth that, due to what is known as “marketbasket analysis,” Walmart (or whatever big box store) puts its beer and diapers in the same aisle. So the rule would go: If customers by diapers then they are likely to also purchase beer. The myth goes that this is because the young husbands get sent out to buy diapers and pick up some beer for themselves while they are out. Association rules is also how Netflix and Amazon determine what to recommend to you. Association rules are easy to interpret and describe, and they handle skewed data very well (for example, my survey results were 70% women, 30% men).
  11. Here is what the association rules look like in Weka. I understand that the variable names and values are not particularly descriptive on this screenshot, you need my survey “codebook” to explain what the variables mean and what each value signfies. This run of my association rules algorithm shows that if students are “somewhat” confident in finding the information they need (confidence =4), they are likely off campus, full time students (residency=2). This is interesting because survey gave 4 options: extremely confident, very confident, somewhat confident, and not confident at all. Zero respondents indicated that they were not confident at all. Our least confident students are “somewhat confident” and our least confident students are most frequently full time commuters, as opposed to part timers or full time on campus students. Hmmmm…..
  12. There is, actually a fourth data mining task known as anomaly detection. This is the opposite of something like classification or prediction, it is identifying the outliers which DON’T fit your model. Practical uses for anomaly detection are are: determining credit card fraud (your credit card was just used in Bali, and you are in Rochester) and email spam filtering algorithms. I don’t have a good example of this one, because the goal of my survey was to look for patterns and trends, and also because it works best with so-called “Big Data,” but you can see where this is a useful application in a business context.
  13. So what are some things we can do with data mining techniques to provide better user services, work processes, and administrative decisions? Like me you could take user survey and use the data to predict and associate certain characteristics with library resource use. (and try to classify!) You could try to determine the likelihood that a book will go missing by considering various circ stats as the explanatory variables (times circulated, publication year, call number range, etc.) Cluster the patterns of library use of subject groupings (call number ranges) by explanatory variables such as counts of : interlibrary loans, purchases on demand, circulation of books, journal article downloads Determine which academic majors or faculty departments are most associated with the use of various services: (reference, borrow a kindle, check out more than 5 books a semester)
  14. I hope this has talk has given you some ideas about the possibilities of what we can learn from mining library data for interesting insights, patterns, and information. Some things to consider – all of this interesting analysis is predicated on GOOD DATA, and getting “good” data may be more challenging than the analysis itself. Privacy concerns may keep us from mining data about our library users, for example, we don’t make circulation data available. But such data in the aggregate can be used to great effect, provided extreme care is taken to protect our users’ identities. Second, we may not currently be collecting the data we need to appropriately tell our stories; we may have to change what information we collect and how we collect it to get the “good stuff.” Do a data inventory of your library! What is missing to help you achieve your strategic goals? And, even if you have data that you’re ready to mine, you may not be ready to do the mining yourself. But now that you know what possibilities exist, why not ask around on campus for help? Computer science and statistics students may welcome the opportunity.