播放

558 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
558
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 我們利用兩種圖示來表現視覺化
    1.圖餅圖:使用圓餅的好處在於可以利用圓的半徑大小來決定該群內的文章數目
    2.雷達圖:雷達圖可以表現群與群之間的關係強度,不過雷達圖僅適用於該群有大於三個相關的群
    接下來我們介紹本系統視覺化的呈現方式
  • *我們系統目前有541篇文章,主要是citeseer資料庫中與information retrieval相關的paper
    這是我們系統的首頁畫面,左邊為我們的分類主題,是經由文件分群之後每一群的main topic
    *右邊則是分群結果,圓的大小代表包含文章數的多寡,顏色的深淺則代表群內的相似度
    當相似度愈高顏色愈深
  • 接下來我們說明一下如何利用本系統尋找我們所需要的文章
    假設我們現在要找som相關的文章
    *我們便可以從分項的類別中尋找與som相關的主題
    *其就會列出som的相關文章
    *也可以直接看到相關文章的abstract,方便使用者判別該文件是不是他所需要的
    *並可以得到與som相關的主題概念有那些,除了用條列式呈現之外
    *也用雷達圖表示
  • 雷達圖可以點選放大,讓使用者看的更清楚
  • 播放

    1. 1. 1 Bibliomining: An Introduction
    2. 2. 2 Outline • Introduction • Bibliomining Process • Example Applications • Placing Bibliomining in Context • A Research Agenda to Advance Bibliomining
    3. 3. 3 Origins and Definition of Bibliomining • ‘‘bibliometrics’’ + ‘‘data mining’’ – Bibliometrics focuses on the creation of works – Data mining (Web usage mining) focuses on the access of works • The application of data mining and bibliometric tools to data produced from library services • Gain a better understanding of library user communities – Frequencies and aggregate measures hide underlying patterns • The combination of data mining, bibliometrics, statistics, and reporting tools used to extract patterns of behavior-based artifacts from library systems for aiding decision-making or justifying services
    4. 4. 4 Bibliometrics • Traditional bibliometrics is based on the quantitative exploration of document-based scholarly communication • Data for bibliometrics – Works: authors, collections – Connections: citations, authorship, common terms, other aspects of the creation and publication process • Allow the researchers to understand the context in which a work was created, the long-term citation impact of the work and the differences between fields in regard to their scholastic output patterns
    5. 5. 5 Data for Bibliometrics
    6. 6. 6 Bibliometrics (Cont.) • Frequency-based, Visualization, data mining – Frequency of authorship in a subject, commonality of words used, and discovery of a core set of frequently cited works – Integrating the citations between works allows for very rich exploration of relations between scholars and topics – Linkages between works are used to aid in automated information retrieval and visualization of scholarship and the social networks between those involved with the creation process – Many newer bibliometric applications involve Web-based resources and hyperlinks that enhance or replace traditional citation information
    7. 7. 7 Social Network
    8. 8. 8 User-based Data Mining • One popular area: the examination of how users explore Web spaces (Web usage Mining) – Focus on accesses of different Web pages by a particular user (or IP address) – Patterns of use are discovered through data mining and used to personalize the information presented to the user or improve the information service • In user-based data mining, the links between works come from a commonality of use – If one user accesses two works during the same session, for example, then if another user views one of those works then the other might also be of interest
    9. 9. 9 Data for User-Based Data Mining Links between works that result from the users
    10. 10. 10 Data for Anonymized Community- Based Web Usage Mining Demographic Surrogate
    11. 11. 11 Bibliomining Process
    12. 12. 12 Overview • Determining areas of focus • Identifying internal and external data sources • Collecting, cleaning, and anonymizing the data into a data warehouse • Selecting appropriate analysis tools • Discovery of patterns through data mining and creation of reports with traditional analytical tools • Analyzing and implementing the results
    13. 13. 13 Determining Areas of Focus • Might come from a specific problem in the library or may be a general area requiring exploration and decision-making • Directed data mining: problem-focused – Ex. Budget cuts have reduced the staff time for contacting patrons about delinquent materials. Is there a way to predict the chance patrons will return material once it is one week late in order to prioritize our calling lists? • Undirected data mining: consider general topical area – Ex. How are different departments and types of patrons using the electronic journals? – May produce an overwhelming number of patterns to explore for validation – should be considered only when a strong data warehouse is in place
    14. 14. 14 Identifying Data Sources • The bibliomining process requires transactional, non- aggregated, low-level data • Privacy issue? • Internal data sources are those already within the library system – Patron database, transactional data, Web server logs • External data sources – Demographic information related to a specific ID number that is located in the computer center or personnel management system – Demographic information for zip codes from census data
    15. 15. 15 Data for Bibliomining
    16. 16. 16 Conceptual Framework for Data Types in the Bibliomining Data Mining
    17. 17. 17 A Framework for the Data • Data about a work – Three kinds of fields • Fields that were extracted from the work (like title or author) • Fields that are created about the work (like subject heading) • Fields that indicate the format and location of the work (like URL or collection) – Come from a MARC record, Dublin Core information, or CMS – Can also connect into bibliometric information, such as citations or links to other works • May require extraction from the original source (in the case of digital reference) or linking into a citation database – Challenge: no article level usage reports
    18. 18. 18 A Framework for the Data (Cont.) • Data about the user – Demographic surrogate – Other fields that come from inferences about the user: zip code, location/department/lab (inference from IP address)
    19. 19. 19 A Framework for the Data (Cont.) • Data about the service – Searching, circulation, reference, interlibrary loan and other library services – Fields common to most services include time and date, library personnel involved, location, method, and if the service was used in conjunction with other services – Each library services also has a set of appropriate fields • Searching: the content of the search and the next steps taken • Interlibrary loan: cost, a vendor, and a time of fulfillment • Circulation: acquisition process of the work and circulation length.
    20. 20. 20 Creating the Data Warehouse • A data warehouse is a DB that is separate from the operational systems and contains a cleaned and anonymized version of the operational data reformatted for analysis • Use queries to extract the data from the identified sources, combines those data using common fields, cleans the data, and writes the resulting records into either a flat file or a relational database designed specifically for analysis • Can be automated to pull data from the operational systems into the data warehouse on a regular basis
    21. 21. 21 Creating the Data Warehouse – Protecting Patron Privacy • Going through the data warehousing process requires the library to examine their data sources • By explicitly determining what to keep and what to destroy, libraries can save the demographic information needed to evaluate communities of users without keeping records of the individuals in those communities • Two examples
    22. 22. 22 Cleaning Transactional Records
    23. 23. 23 Cleaning Web Server Transactional Records
    24. 24. 24 Creating the Data Warehouse – Building the Data Warehouse • Building the data warehouse takes much more time than mining the data • Suggest to start with a narrowly defined bibliomining topic and work through the entire process • This iterative process also has the advantage of allowing those developing the data warehouse, to improve their collection and cleaning algorithms early in the life of the bibliomining project
    25. 25. 25 Selecting Appropriate Analysis Tools • Traditional Reporting • Management information system (MIS) • Online Analytical Processing (OLAP) • Visualization • Data Mining
    26. 26. 26 Analysis Tools – Traditional Reporting • Library decision-makers examine aggregates and averages to understand their service use • The advantage to the data warehouse is that new questions can be asked not only of the present situation but also, the past – This allows those doing evaluation or measurement to ask new questions and then create a historical view of those reports in order to understand trends • Libraries can more easily understand behavior between different demographic groups in the library
    27. 27. 27 Analysis Tools – Management information system (MIS) • Provide a manager with the ability to ask basic questions of the data • ILS packages have some type of basic MIS built in • An MIS built on top of a data warehouse made for the library will be more powerful and provide information that the library needs to see • Another addition to MIS is a critical factor alert system – Example: if hourly circulation (factor) is below or above a certain level, a manager could be immediately notified so staffing changes could be made
    28. 28. 28 Analysis Tools – Online Analytical Processing (OLAP) • An interactive view of the data • Under the surface, the OLAP tool has run thousands of DB queries to combine all of the selected variables along with all of the selected measures (aggregation types, timeframes…) • All of the fields are defined ahead of time, and the system runs many queries before anyone uses it – Response to the manager using the OLAP front-end for reports is instant, which encourages exploration • Penn Library Data Farm (http:// metrics.library.upenn.edu/prototype/datafarm/)
    29. 29. 29 Analysis Tools – Online Analytical Processing (OLAP) (Cont.) • The user will pick one of many variables from a list to examine • Example: use of e-journals under dimensions, such as time and subject – A high-level view of this data in a tabular report (year and general classification) – Expand the report -- click on a year  expand the year into quarters, leaving the subject headings the same and recalculating the data. – The user can then click on another field to drill down into the data • During exploration, the manger can capture any view of the data and turn it into a regular report
    30. 30. 30 Analysis Tools – Visualization • Present the characteristics of data in a visual form
    31. 31. 31 Analysis Tools – Data Mining • Discovery of valid, novel, and actionable patterns in large amounts of data using statistical and artificial intelligence tools • Two main categories of data mining tasks – Description: understand the data from the past and the present • discover patterns for affinity groups of variables common to different patrons or clusters of demographic groups that exhibit certain characteristics (association rule mining, clustering) – Prediction: make a statement about the unknown based upon what is known • Classification (place an item into a category) • Estimation (produce a numeric value for an unknown variable)
    32. 32. 32 Analysis Tools – Data Mining (Cont.) • Techniques: neural networks, regression, clustering, rule generation, and classification • Process: – Take a cleaned data set – Generate new variables from existing ones – Split the data into model building sets and test sets – Apply techniques to the model building sets to discover patterns – Use the test sets to ensure the patterns are more generalizable – Confirm these patterns with someone who knows the domain • Web Usage Mining, Text Mining (+ bibliometrics)
    33. 33. 33 Analysis Tools – Category & Cluster Results Category Cluster Results
    34. 34. 34 Cluster LabelRelated TopicCitation Relation Analysis Tools – Cluster Detail Information Cluster Label Related ArticleAbstract
    35. 35. 35 Analysis Tools – Citation Relation
    36. 36. 36 DREW Open Effort Project • Digital reference electronic warehouse (DREW) . – Develop an XML schema to… • Allow digital reference transactions from different services and in different communication forms to live together in one space • Allow researchers to access these archives and explore them using a variety of methods – Capture the results of this research into a management information system, and then allow the reference services to view their own archives through the tools created by the researchers • Knowledge base, citations and links to other works
    37. 37. 37 Analysis and Implementation • Once the results have been developed, they must be validated – Test and tweak the model with data that were not used during the development process (training and test) – The most important validation is to have a librarian who is familiar with that particular library context examine the models . • Implement the report/model – Essential to monitor the variables that power the models over time; if the mean of a variable strays too far because of changes in the library, the model may have to be reevaluated
    38. 38. 38 Example Applications – See Another PPT
    39. 39. 39 Placing Bibliomining in Context
    40. 40. 40 Conceptual Framework for Decision-Makers
    41. 41. 41 Conceptual Framework for Library and Information Scientists Hypothetico-Deductive-Inductive Method 歸納 推論
    42. 42. 42 Understanding both Frameworks • In both frameworks, bibliomining is not the end of the exploration process • It is one tool to be used in combination with other methods of measurement and evaluation, such as LIBQUAL, E- metrics, cost-benefit analyses, surveys, focus groups, or other qualitative explorations • Using only bibliomining to understand a digital library can result in biased or incomplete results • While the information provided by bibliomining is useful, it needs to be supplemented by more user-based approaches to provide a more complete picture of the library system
    43. 43. 43 A Research Agenda to Advance Bibliomining
    44. 44. 44 Data Collection • Various data sources – Integrated library system – Web-based front-end to digital libraries (federated search) – A system to support interlibrary loan – A system to support digital reference services – External systems – citation databases, census data • How to collect data and match it between systems – Standard for data – Project COUNTER, NISO Z39.7-200x (library metrics and statistics)  aggregate-level data – Cooperation between system creators – easily exportable data warehouse and match between systems through common fields
    45. 45. 45 User Privacy • The bibliomining data warehouse can provide the method for keeping information about the materials used in the library without maintaining specific information about the users of the library • How about the effect this anonymization has on the power of the data mining tools to discover patterns? • Privacy-protecting data mining • Privacy issues coming from Digital Reference Service (DRS): personal information in the questions – Text mining and NLP
    46. 46. 46 Variable, Metric, and Model Generation • While researchers have developed metrics for library statistics, they have primarily focused on fields from one data source • Once the warehouse has been constructed, the possibilities grow for the discovery interesting variables for mining and metrics for evaluation • Start in the data mining process, looking for relationships between individual variables that allow for deeper understanding – Through the patterns discovered with data mining, new metrics and measures can be proposed • Example: one-time high-demand needs VS. needs that represent the general user base
    47. 47. 47 Integration of Management Information System and Data Mining tools • Integrate the found algorithms into the systems that drive digital libraries • This combination of a built-in data warehouse, interactive reporting module, standards for report description, and modular design will make it much easier for library decision- makers to get involved with bibliomining. • Toward developing these integrated modules for other systems that support digital libraries
    48. 48. 48 Multi-system Data Warehouses and Knowledge Bases • The creation of services that span many digital libraries – Library consortia – Joining together digital library sources and services while still maintaining identity for those participating (like National Science Digital Library) • Join data warehouses with libraries that have similar user groups and similar collections – Agree demographic surrogates or develop a cross-walk algorithm to map demographics – Need to ensure that these patterns apply to their own library before making decisions based upon them • Methods for combining utilization and collection metadata between different systems. – Standardize a series of metrics (what do “Hit” and “Visit” mean?) – Create a standard for record-level data (MARC, COUNTER…)
    49. 49. 49 Conclusion: moving beyond evaluation to understanding • The final and most long-lasting area of research of bibliomining is improving understanding of digital libraries at a generalized, and perhaps even conceptual, level • These data warehouses will combine resources traditionally unavailable in this combined form to researchers – What connections can be made between patron demographics, and bibliometric-based social networks of authors? – How much influence do the works written and cited by faculty at an institution have on the patterns of student use of library services? – How do usage patterns differ between departments or demographic groups, and what can the library do to better personalize and enhance existing services? • Qualitative + quantitative

    ×