Book Recommendation System using Data Mining for the University of Hong Kong Libraries

  • 946 views
Uploaded on

RAJAGOPAL, Sandhya (Faculty of Education, University of Hong Kong) …

RAJAGOPAL, Sandhya (Faculty of Education, University of Hong Kong)
http://citers2012.cite.hku.hk/en/paper_528.htm

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
946
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Ladies & Gentlemen. Good Afternoon and Welcome. My name is Sandhya Rajagopal and it gives me great pleasure in presenting the design of a Book recommendation system developed based on data from the HKU Libraries.
  • In the next 15 mins I will describe the system starting with the background and motivation for the study & the research questions that were probed. A description of the topics relevant to the research follows in the LIT Review section. Next I will provide an overview of the logical 4-step process that was used as Research Methodology. Lastly, the strengths of the system and areas that require further research will be presented.
  • Academic research & Info Search: In the current competitive academic landscape, access to information is critical for conducting high quality Research. Increasingly Internet search engines have had and continue to have a significant influence on the way information is sought and obtained by learners. Research shows that there is a distinct preference by students to seek information on the Internet, so much so that, the word “Googling” is used synonymously with “searching”. This despite the fact that the World Wide Web is often an unreliable source of information. Ease of use is the most common reason the Internet is preferred to Library OPACs. Although resources are better organized in Library OPACs and Subject Headings provide a proven means of efficient information retrieval, due to lack of knowledge of proper use and a prevalent feeling that special skills training is required to use subject searches, they remain largely underutilized. An identified lack of studies that capitalize on systematic subject heading organization which has a high potential of having significant influence on access to selective, high-quality library resources forms the fundamental basis for this Project. One method of providing better access is to personalize retrievals and make them pertinent to the learner. Automation of such personalization not only overcomes the problem of the need for info seekers to be skillful at constructing search strings, it also allows better utilization of library resources by presenting results which may otherwise be left out. Recommender systems, which serve as personal advisors and generate suggestions based on profile of a user, is a viable solution that can serve dual benefits of personalization and capitalizing on Subject heading organization of OPACs.
  • As a viable means to this end, the specific questions that the study asked were Formulating the best method of extracting meaning user profiles from existing library data Since Data mining is one of the most suitable methods for pattern recognition, the feasibility if applying these techniques for the design of the system that will also capitalize on subject searching, formed the critical basis of the Research.
  • A four step process was used for the Research Methodology. Each of the 4 steps were explored in detail and these will be presented a little later. The two outcomes of the study proved that (a) meaningful user profiles could be generated from the existing library catalog and circulation data and Data Mining can be used to design a Recommendation System that makes personalized suggestions based on subject heading organization of OPACs(b) Construction of the design is a feasible option to resolve issues with information searches.
  • The primary areas of academic research relevant to this study were : -> Recommendation System algorithms, their definition, what their functions are and the two primary types that are most prevalent currently -> Study of what Data Warehousing entails and what a dwh architecture is-> Data mining techniques, mechanisms and associated functionality for each. -> What the knowledge discovery process is and How Data Mining can be incorporated in this process -> and how Data Mining can be incorporated into Recommendation System designs
  • -> A Recommendation System makes personalized suggestions to users of a variety of computer applications based on their past preferences and / or on the characteristics of the items of interest to them. -> Their primary functions are to serve as the user’s exclusive advisor capable of providing recommendations pertinent within wide a range of parameters that can be specified by the user himself. This in turn enables improving the quality of decisions made based on the recommendations Due to its focus on relevance to a narrow field of interest, there is high likelihood for unanticipated and new knowledge discoveries
  • The 2 most common types of Recommendation System algorithms are Collaborative Filtering and Content-based algorithms. In CF the Recommendation System matches preferences of each patron to those of others who have similar preferences in activities such as reading, buying, dating, etc. Amazon.com uses customized version of this method. The key inputs to this type of system are community data such as rating, views etc. and a user profile created based on usage history of a patron. The key question here is “tell me what’s popular among my peers” In the content based method, suggestions are made based on the characteristics of the items of interest such as the author, genre, title of a book and user profiles which holds each user’s interests, preferences, usage history, etc. The key task in this case is to identify those items that best matches a user’s preference. Whatever the algorithm the critical element in a Recommendation System is the recommendation component which is instrumental in generating the final output to the user. (why content-based algorithm is most suitable – Page 17)
  • Data warehousing can be thought of as a specific manner of storing data and the repository which holds this data is called a DWH. Unlike operational databases it is also associated with particular tools and techniques for data analyses that support strategic decision-making. (Appendix (ii) page 62) Starting with a skeletal structure, consisting of the Source Layer, the DWH layer and the Analysis component, the architecture of a DWH can be a 2-layered design by introducing the Data staging layer or a 3-layered one, by inserting 2 more layers – the reconciled layer and the Loading stage.The source layer is the combined inputs from operational and external databases which form the primary input to the DWH. In the Data staging stage, raw data is pre-processed using Extraction-transformation-Loading or ETL tools to cull out data essential for storage in the DWH Data reconciliation may be required in some systems and this involves performing data integrity & consistency checks, error correction, establishing currency, etc. Such reconciled data is then loaded to establish a primary DWH. The Analysis component may consist of one or more analysis tools for – Reporting, Squential Query processing (SQL), Online Analytical Processing (OLAP), Data Mining, What-if-analysis, etc.
  • One of the ways to define Data Mining is that it is “an automated data exploration & analysis process that uncovers meaningful patterns & rules” the operative words being – Exploration, automatic, meaningful patterns. The power of employing Data Mining techniques is to understand that: it is an exploratory process seeking answers to often ill-defined questionsit is necessarily an automated process relying on machine learning principles, to analyze large amounts of data the goal of Data Mining is to uncover hidden knowledge that may have potential for significant influence on problem-solving and decision-making, by projecting meaningful data patterns. Data Mining serves many functional purposes:When used as a descriptive tool it helps researchers understand underlying patterns, trends and behaviors. The function of classification is to categorize items into pre-defined classes that have been constructed based on analysis of existing data items, called a training set. The Data Mining task is then to build a model to classify new, previously unclassified data. Estimation is similar to classification where the target variables are numeric rather than categories. Models are built to represent items in the training set, which provide both the target variable and a predictor. Based on the value of predictors, estimates of the target variable of new items are made using the corresponding predictors. Prediction is similar to classification and estimation and differs only in the fact that the results of the prediction lie in a future, and is not immediately verifiable. In Clustering, items with similar characteristics are grouped together and these algorithms aim to partition the entire set into homogeneous subgroups or clusters, ensuring in the process that the similarity of records within the cluster is maximized and the similarity to records outside the cluster is minimized. Also known as ‘affinity analysis’ or ‘market-basket analysis’, the goal of the association function is to determine which characteristics go together and define numerical rules relating 2 or more attributes. They are often expressed along with a measure of confidence and support, as an estimate of credibility that can be attached in applying a rule.
  • The diagram in this slide shows the Data Mining algorithms that are commonly used for different functionalities that they serve. One or more of each of these methods can be applied depending on the expected output of the system.
  • According to an interpretation of the process of Knowledge Discovery, the various stages between assimilation of raw data and the final discovery of Knowledge, can be thought of as Selection, Preprocessing, Transformation, Data Mining and Interpretation or Explanation. The recurrent feedback looping from each stage builds in Quality of data processed. As is apparent, Data Mining forms a critical part of this process. … And Knowledge Discovery forms the founding basis of the design of the Recommendation System in this study.
  • The study involved a 4-step process namely : Systems Analysis of HKUL’s Innopac Data Warehouse Design Application of Data Mining Recommendation System Model Each of these steps were conducted sequentially moving from one stage to the other after thorough analysis.
  • For its Information Services at the library, HKUL uses several modules offered by the commercial Information system from Innovative Interfaces called Innopac. The Innopac range of products that HKUL utilizes can be clubbed into three groups. Service support to patrons under three categories – First, Staff Functions covering processes such as Acquisitions, Serials, Cataloging, Circulation, Management Reports , etc. , next Patron Services that includes products such as WebPac Pro (Spell checks, RSS feeds, technologies supporting web-computing and presentation of information using CSS sheets),AirPac (access to the library catalog using Smartphones ), My Millennium (a range of user services such as ‘My ResearchPro’, ‘My Library’, personalized messages about library use, etc), Express lane (self-checkouts at kiosks), eCommerce (online fine collection), Program Registration (Access to library programs) and lastly Campus Computing, is provided by the Millennium Integrated Library System. (ii) Discovery tools that support a variety of search functions include - Encore or Dragon 2.0 an extension of the Library OPAC, - Research Pro is a federated search engine that performs searches on multiple information resources including the OPAC, electronic data bases and Google Scholar. - Pathfinder Pro links search results to websites, electronic databases and other library databases(iii) Resource Sharing utilizes the INN-Reach and Article-Reach modules of Innopac. INN-reach, called ‘HKALL’ at HKUL, allows sharing of library resources across all libraries in Hong Kong that have entered a consortium arrangement. Automatic fulfillment of requests enhances loan processing significantly. Information resources at HKUL identified as important to the current study are : Dragon OPAC, Patron File (part of HKUL’s ILS), Circulation Information (part of ILS and displayed as Patron circulation record), HKALL (Innreach module), Dragon 2.0 (extension of Dragon with collaborative features), Book recommendation list (under eForms option for patrons to recommend books for purchase by the library), ILLIAD or Inter-library loan system (under eForms option for patrons to borrow books from libraries outside of Hong Kong), HKU Scholar Hub (a resource detailing HKU faculty research). After an analysis of the data sources and processes they are associated with, the three tables established as relevant for the current study are: Dragon OPAC, Patron Information and Circulation Information. Data Fields from these sources were used for the design of the DWH.
  • Central to the warehouse is the patron. Items borrowed by the patron can be accessed from the Circulation file based on a unique patron identifier. Also contained in the same file accessible in Innopac, are the item’s bibliographic information, shelf location provided by the call number, availability status, date on which the item was borrowed, saved search strings etc. All such inter-related information will form a part of the DWH. Accordingly, the three critical data sources are : patron Information, Circulation Information & Dragon OPAC. These sources relate through common fields namely – Patron_ID which uniquely identifies a patron and Item_ID which uniquely identifies an Item. Each item can fall under multiple Subject headings and this represented by the dashed lines in the diagram.
  • Keeping the Data sources identified for the DWH design, the steps in construction of the warehouse are: First, generate Patron information and populate the Patron_file with data : Patron ID, Patron Name, University Number and Contact information. For eg. If there are n patrons p1 to pn, each will hold a record with information about them. Second, populate the Circ_Info file by culling out data associated with usage history of the patron. For eg. If patron P1 has borrowed i11, i12 .. I1x, (i.e.) x items in the past, each of these items will be identified by the Item ID, checked out date, author & title of item and its call #Third, subject headings for each of the items i1 thru’ i1x borrowed by a patron P1 needs to be generated. Once such essential data has been culled out from operational databases, the steps in Data warehousing should be performed to construct the DWH.
  • For the purpose of this preliminary study, k-means clustering was identified as a suitable Data Mining tool after comprehensive literature review and thorough analysis of existing Recommendation Systems. (Choice of a Data Mining algorithm : Page 33) The steps used in the application of this technique to the prepared DWH are : (HKUL Example – Page 50) Designate the number ‘k’ for the number of clusters. Each of these points will serve as cluster centers. If, for instance patron p1 has borrowed item I11 and there are 2 subject headings for this items, in this case the number k will be designated as 2. The entire record set that will include all nearby subject headings, along with the k-primary cluster centers will then need to be vectorized in multi-dimensional space. There will be as many dimensions as the number k. Hence, for example, if k=2, the all subject headings for the item I11 will be represented as points in a 2-dimensional space. Next the distance between each of the points in space and each of the k-cluster centers needs to be calculated. The recommended methods for distance calculation are the Minkowski method, Euclidean distance and City Block distance. Clustering of each point in the k-space around the centers can be performed such that the distance between points around a center (i.e. within-cluster variation or WCV ) is low but the distance between clusters (i.e. between-cluster variation BCV ) increases. In other words the ratio of these two is maximized.To calculate the position of centroids or the new cluster centers,the mean distance of items from the initial center are determined. Subsequently all the points in the data set are re-clustered around these new centroids. In re-clustering, the points will shift closer to centers to which they lie closest hence creating coherent groups that are homogeneous in nature within each group but distinctly separate across groups. By repeating the recalculating centroid positions and re-clustering until there are no more items to re-position, the application can be completed.
  • The final step in the methodology involved the compilation of the various parts of the process to evolve the Recommendation System design. In short - Information from Innopac databases, namely : Circulation Statistics, Patron details and Subject Headings, are extracted, transformed and loaded into a DWH. When k-means clustering is applied to this DWH, a personalized list of recommendations should be output from the system. This establishes the design of the Recommendation System.
  • As a result of this study, the fundamental feasibility of constructing this system is established. The two research questions stand resolved. Hence, it can be concluded that HKUL data can be used to generate user profiles, based on subject headings, automatically and that Data Mining can be applied to generate pertinent item recommendations to researchers. Since Innopac is a popular integrated library system, the generalized applicability of this design to other libraries is deemed logically feasible. One of the obstacles in applying k-means clustering is designation the number k and hence the cluster centers. Since in this design, k is generated automatically, this issue can be overcome. The design is logical in layout, flexible in adapting to different platforms and is scalable in construction which makes it a viable solution for libraries.Improved search effectiveness and better resource utilization are direct benefits that patrons and the library can immediately benefit from.
  • The study is based on literature evidence that Subject searches are an efficient means for information retrieval for Libraries but are often under utilized by patrons. This basis needs to be established through qualitative research methods along with a User needs analysis clearly re-iterating the requirement for such a system. Further research is also required for developing this design into a fully functional system, taking into account all constraints that might be encountered in such construction. This will help in evolving the practical considerations of design. Even though the design at the conceptual level can be separated into logically distinct units within each of which process flows can be customized to individual organizations, such attributes of generalizability need to be studied in detail and documented clearly.
  • Finally,I would like to express my sincere gratitude to Dr Alvin Kwan who, as the supervisor for the Independentproject in the MLIM course, has been a considerate supporter, guide and advisor, in every aspect of project execution.  Ms Ruth Wong, the Access Services Librarian at HKUL, has been a valuable source of critical information for the development of the project and I am extremely thankful for her prompt and enthusiastic support. I also sincerely appreciate Dr Sam Chu for actively encouraging me and offering every kind of advise and support in extending this project as a full-fledged research study for the PhD program, under his primary supervison.
  • And Thank you all for you attendance. I’ll be happy for any feedback you may be able to offer. Thanks again!

Transcript

  • 1. Book Recommendation System using Data Mining for theUniversity of Hong Kong Libraries By Sandhya Rajagopal CITERS Conference, HKU June 15th, 2012
  • 2. AGENDA Introduction Literature Review Methodology – 4 step Process Merits Further Research
  • 3. Introduction Background Academic Research & Information Search Information Search: Internet Vs. OPACS Subject Heading Organization & Search Efficiency Resource Utilization & Personalization Recommender Systems : a Viable Solution
  • 4. Introduction Research Questions How can meaningful profiles of user preferences be extracted from Library Usage data ? How can Data Mining techniques be applied to recommend personalized, pertinent items, simultaneously capitalize on Subject searches to improve overall effectiveness of OPACs?
  • 5. IntroductionResearch Method Systems Analysis of HKUL’s Innopac Data Warehouse Design Application of Data Mining Recommendation System Model Research Outcome  Resolution of Research Questions  Feasibility of Recommendation System Design
  • 6. Literature Review Definition Recommendation System Algorithms Functions Types Definition  Data Warehousing Architecture  Data Mining Definition Functionality  Data Mining & Knowledge Discovery Data Mining & Recommendation Systems
  • 7. Literature RecommendationReview System AlgorithmsDefinition A Computer System which computes & presents pertinent choices Functions  Serves as a personal advisor  Improves Quality & Effectiveness in decision-making  Increases potential of serendipitous discoveries
  • 8. Literature RecommendationReview System Algorithms Types Collaborative Filtering Algorithm  Content-Based Algorithm (Zanker& Jannach, 2010) (Zanker& Jannach, 2010)
  • 9. Literature Data WarehousingReview Definition  A specific manner of storing data  A set of tools & techniques for data analyses to support decision-makingArchitecture
  • 10. Literature Data MiningReview Definition  an automated data exploration & analysis process that uncovers meaningful patterns & rulesFunctionality Description Explain underlying patterns Classification Categorize items into ‘Training Sets’ Estimation Categorize numerically & estimate value of new items Prediction Categorize & forecast future results Clustering Group similar items & maximize intra-group similarities Association Identify similar items & uncover linkage rules
  • 11. Literature Data MiningReview & Recommendation Systems K-nearest Neighbor Decision Trees Prediction Classification Decision Tree Rules Bayesian Networks Space Vector Model Artificial Neural Networks Analysis Association Rule Mining Description K- means clustering Clustering Density based clustering Message-passing clustering Hierarchical Clustering[Exacted from: (Fayyad, Piatetsky-Shapiro, & Smyth, 1996)]
  • 12. Literature Data MiningReview & Knowledge Discovery Data Mining .. A critical component in Knowledge Discovery Knowledge Discovery .. the basis for design of the Recommendation System (Fayyad, Piatetsky-Shapiro, & Smyth, 1996)
  • 13. Methodology 4 - Step Process Systems Analysis of HKUL’s Innopac Data Warehouse Design Application of Data Mining Recommendation System Model
  • 14. Methodology Step 1: Systems Analysis of Innopac Service Entities  Staff Functions  Patron Services Discovery Tools  Campus Computing Resource Sharing  Encore  INN-Reach  Research Pro  Article-Reach  Pathfinder Pro  Relevant HKUL Resources Dragon OPAC Author, Title, Call #, Location, LCSH Patron Information Patron ID, Name, HKU ID, e-mail Circulation Information Author, Title, Call #, Check-Out dt, search
  • 15. Methodology Step 2 :Data Warehouse Design Dragon OPAC Circulation SEARCH_HIST Information Patron CIRC_INFO Information Patron_ID PATRON_FILE Search_String Patron_ID Item ID Patron_ID Date_Checked_Out Patron_Name SUBJ_HDGS Author Univ_Num Title Item_ID E-mail_ID Call_Num Subj_Headings Location Status Num_of_Items
  • 16. Methodology Step 2 :Data Warehouse Design Process Flow Example Generate Patron Information P1, P2, P3, … ,Pn (or) P1 > P001 > P1_Name > Populate PATRON_FILE P1_Unum > P1_email P1 > P001 I11, Item_ID Generate Circulation Information I12, (or) Date I13, Populate CIRC_INFO …, Author I1x Title Call # Generate Subject Headings P1 > I11,I12,I13,…,I1x (or) Populate SUBJ_HDGS P001, Item_ID, Subject Headings
  • 17. Methodology Step 3 : Application of Data Mining k-means Clustering Steps Designate the number ‘k’ as number of clusters Vectorize record set along with centers Calculate distance of each vectorized record from centers Cluster records around the centers minimizing distance Calculate new centroids : Mean of center co-ordinates & re-cluster Repeat steps until no items are re-clustered
  • 18. Methodology Step 4 : Recommendation System Modeling Innopac Recommendation List Circulation Patron Subject Statistics Details Headings Recommendation Component Extract/Transform/ Load HKUL Data Warehouse Subject Heading Clusters Reconcile Data User Profiles
  • 19. Merits Feasibility of Design Generalized Applicability Automated generation of k Logical, flexible & ScalableIncreased Search effectivenessBetter utilization of Library Resources
  • 20. Further ResearchQualitative Research Establish efficacy of Subject searches Establish need among Patrons Systems Development Research Evaluate generalizability
  • 21. Acknowledgment Dr. Alvin Kwan Teaching Consultant, Faculty of Education Ms. Ruth Wong Access Services Librarian, HKUL Dr. Sam Chu Associate Professor, Faculty of EducationReferencesZanker, M., & Jannach, D. (2010). 31. Introduction to Recommender Systems: Tutorial at ACM Symposium on Applied Computing 2010 [Tutorial - Presenation ]Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). 14. From Data Mining to Knowledge Discovery in Databases. AI Magazine 17(3).
  • 22. Thank You