Meow Hagedorn

meow ::06 David Newman Bill Landis, ex officio Kat Hagedorn Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006

Clustering, Classification, and Metadata Enhancement Techniques on OAI Records ,[object Object],[object Object],[object Object]

Goals ,[object Object],[object Object],[object Object],[object Object]

What We Did Cluster Preprocessing & Topic Modeling > vocabulary preprocess topic model (cluster/learn) topics OAI records

What We Did vocabulary preprocess topic model (cluster/learn) topics Cluster OAI records vocab -ulary preprocess topic model (classify) 1. topics in records 2. records in topics oai rec Classify Preprocessing & Topic Modeling > OAI records

What We Did Cluster Classify Preprocessing & Topic Modeling > clustering is learning the topics classification is using the learned topics vocabulary preprocess topic model (cluster/learn) topics OAI records vocab -ulary preprocess topic model (classify) 1. topics in records 2. records in topics oai rec OAI records

Repository Selection ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing & Topic Modeling >

Selected Repositories * *Repositories harvested by UMich/OAIster, June 7, 2006. Preprocessing & Topic Modeling > 1 in 3 141,000 Research Papers in Economics repec 1 in 3 625,000 PubMed Central pubmed - 370,000 Publishing Network for Geoscientific and Environmental Data pangaea 1 in 3 131,000 Office of Science and Technology Information osti 1 in 2 33,000 The National Science Digital Library nsdl - 239,000 Library of Congress Digitized Historical Collections loc 1 in 3 212,000 Institute of Physics iop 1 in 2 29,000 Directory of Open Access Journals Articles doaj 1 in 3 717,000 CiteSeer Scientific Literature Digital Library citeseer 1 in 2 45,000 CERN Document Server cern - 3,000 Caltech Electronic Theses and Dissertations caltech 1 in 3 368,000 arXiv.org Eprint Archive arxiv Records used for clustering (learning) Records Description Short Name

Usage of Dublin Core Fields ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing & Topic Modeling >

Preprocessing Example <ID=oai:CiteSeerPSU:44072> <title> Reinforcement Learning: A Survey <description> This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word "reinforcement." … <subject> Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey vocab -ulary preprocess <ID=oai:CiteSeerPSU:44072> reinforcement learning survey survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement … leslie pack kaelbling littman andrew moore reinforcement learning survey Preprocessing & Topic Modeling >

Stopwords and Stemming ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing & Topic Modeling >

Building Vocabulary ,[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing & Topic Modeling >

Preprocessing & Topic Modeling > ,[object Object],[object Object],[object Object]

Computation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Decision point: How many topics? Decision point: How many iterations? Preprocessing & Topic Modeling >

Broad Topical Categories ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Preprocessing & Topic Modeling >

Broad Topical Categories broad topical categories Preprocessing & Topic Modeling > vocabulary preprocess topic model (cluster/learn) topics OAI records topic model (cluster/learn) Cluster Cluster the clusters

Broad Topical Categories Cluster broad topical categories Cluster the clusters Classify group of keywords vocab -ulary preprocess topic model (classify) topics organized under broad topical categories group of keywords Preprocessing & Topic Modeling > vocabulary preprocess topic model (cluster/learn) topics OAI records topic model (cluster/learn)

The “Browser” ,[object Object],[object Object],[object Object],[object Object],[object Object],*Based on 750,000 sampled records from 9 repositories, 500 topics The Browser >

The “Browser”: http://yarra.calit2.uci.edu/meow/ The Browser >

Selected Topics: Useful ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Browser >

Selected Topics: Less Useful ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],and very useful to filter out French records The Browser >

Broad Topical Categories (BTCs) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Browser >

BTCs: Clustering the clusters The Browser >

BTCs: Classifying group of keywords ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],domain expert specifies list of relevant keywords and (importance) The Browser >

BTCs: Classifying group of keywords ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],in review, would delete this topic from this BTC just found 1 topic relevant to transportation The Browser >

Browse Records in a Topic nice mix of repositories The Browser > can navigate back to multiple BTCs

Browse Records in a Topic: From one repository The Browser > display records just from Library of Congress

Sample Record ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],The Browser > link to actual OAI record topics for this record

Repository-specific Browsers ,[object Object],[object Object],[object Object],[object Object],[object Object],The Browser >

Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Lessons Learned & Next Steps >

Further Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Lessons Learned & Next Steps >

Discussion Point: When to Re-cluster? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],cluster classify cluster cluster classify classify classify classify classify Lessons Learned & Next Steps >

Products and Services ,[object Object],[object Object],[object Object],[object Object],[object Object],Lessons Learned & Next Steps >

Archive of Topics ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Lessons Learned & Next Steps >

Subject Search/Browse for OAIster ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Lessons Learned & Next Steps >

How To Reach Us ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Meow Hagedorn

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (6)

Similar to Meow Hagedorn

Similar to Meow Hagedorn (20)

More from MedicineAndDermatology

More from MedicineAndDermatology (20)

Recently uploaded

Recently uploaded (20)

Meow Hagedorn