Meow Hagedorn


Published on

Published in: Health & Medicine
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Meow Hagedorn

  1. 1. meow ::06 David Newman Bill Landis, ex officio Kat Hagedorn Clustering, Classification, and Metadata Enhancement Techniques July 24, 2006
  2. 2. Clustering, Classification, and Metadata Enhancement Techniques on OAI Records <ul><li>Preprocessing and Topic Modeling </li></ul><ul><li>The “Browser” </li></ul><ul><li>Lessons Learned and Next Steps </li></ul>
  3. 3. Goals <ul><li>Evaluate topical/subject-based metadata enhancement </li></ul><ul><li>Experiment on testbed of multiple OAI repositories </li></ul><ul><li>Discuss lessons learned and refine testing </li></ul><ul><li>Propose products and services </li></ul>
  4. 4. What We Did Cluster Preprocessing & Topic Modeling > vocab- ulary preprocess topic model (cluster/learn) topics OAI records
  5. 5. What We Did vocab- ulary preprocess topic model (cluster/learn) topics Cluster OAI records vocab -ulary preprocess topic model (classify) 1. topics in records 2. records in topics oai rec Classify Preprocessing & Topic Modeling > OAI records
  6. 6. What We Did Cluster Classify Preprocessing & Topic Modeling > clustering is learning the topics classification is using the learned topics vocab- ulary preprocess topic model (cluster/learn) topics OAI records vocab -ulary preprocess topic model (classify) 1. topics in records 2. records in topics oai rec OAI records
  7. 7. Repository Selection <ul><li>Mix of cultural heritage repositories? </li></ul><ul><ul><li>UMich, Library of Congress, CDL, State Lib of Victoria (Aust), … </li></ul></ul><ul><ul><li>Average of 15 words per record (excl. stopwords) </li></ul></ul><ul><ul><li>Topics often specific to collection (e.g., State Lib of Victoria) </li></ul></ul><ul><ul><li>Experience with CDL’s American West project </li></ul></ul><ul><li>Mix of scientific/research repositories? </li></ul><ul><ul><li>CiteSeer, arXiv, PubMed, … </li></ul></ul><ul><ul><li><description> is a reasonably reliable 200-word abstract </li></ul></ul><ul><ul><li>Average of 75 words per record </li></ul></ul><ul><ul><li>Topics more likely to span repositories </li></ul></ul><ul><li>For purposes of evaluation, used (mostly) English-language repositories </li></ul>Preprocessing & Topic Modeling >
  8. 8. Selected Repositories * *Repositories harvested by UMich/OAIster, June 7, 2006. Preprocessing & Topic Modeling > 1 in 3 141,000 Research Papers in Economics repec 1 in 3 625,000 PubMed Central pubmed - 370,000 Publishing Network for Geoscientific and Environmental Data pangaea 1 in 3 131,000 Office of Science and Technology Information osti 1 in 2 33,000 The National Science Digital Library nsdl - 239,000 Library of Congress Digitized Historical Collections loc 1 in 3 212,000 Institute of Physics iop 1 in 2 29,000 Directory of Open Access Journals Articles doaj 1 in 3 717,000 CiteSeer Scientific Literature Digital Library citeseer 1 in 2 45,000 CERN Document Server cern - 3,000 Caltech Electronic Theses and Dissertations caltech 1 in 3 368,000 Eprint Archive arxiv Records used for clustering (learning) Records Description Short Name
  9. 9. Usage of Dublin Core Fields <ul><li>Decided to use words from <title>, <description>, <subject> for clustering </li></ul><ul><li>Idiosyncrasies </li></ul><ul><ul><li>CiteSeer: repeats <author> and <title> in <subject> </li></ul></ul><ul><ul><li>CiteSeer: puts citations to other IDs in <description> </li></ul></ul><ul><ul><li>arXiv: puts e.g., “Comment: 12 pages PostScript” in <description> </li></ul></ul><ul><ul><li>RePEc: no <subject>, repeats ID in <description> </li></ul></ul><ul><ul><li>etc. </li></ul></ul><ul><li>Approach: Process all repositories identically, no special treatment </li></ul>Preprocessing & Topic Modeling >
  10. 10. Preprocessing Example <ID=oai:CiteSeerPSU:44072> <title> Reinforcement Learning: A Survey <description> This paper surveys the field of reinforcement learning from a computer-science perspective. It is written to be accessible to researchers familiar with machine learning. Both the historical basis of the field and a broad selection of current work are summarized. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. The work described here has a resemblance to work in psychology, but differs considerably in the details and in the use of the word &quot;reinforcement.&quot; … <subject> Leslie Pack Kaelbling, Michael Littman, Andrew Moore. Reinforcement Learning: A Survey vocab -ulary preprocess <ID=oai:CiteSeerPSU:44072> reinforcement learning survey survey field reinforcement learning computer science perspective written accessible researcher familiar machine learning historical basis field broad selection current summarized reinforcement learning faced agent learn behavior trial error interaction dynamic environment resemblance psychology differ considerably detail word reinforcement … leslie pack kaelbling littman andrew moore reinforcement learning survey Preprocessing & Topic Modeling >
  11. 11. Stopwords and Stemming <ul><li>Standard: and, the, … </li></ul><ul><li>Research related: research, paper, data, system, method, result, … </li></ul><ul><li>Repository specific: cern, citeseer, repec, Smith, … </li></ul><ul><li>All tokens starting with a digit: 1996, 401k, … </li></ul><ul><li>Produced stopword list of 500 words </li></ul><ul><li>Applied very simple stemming ( cars  car ) </li></ul><ul><li>Note: replacing collocations improves interpretability of topics, but not quality ( los angeles  los_angeles ) </li></ul><ul><li>Don’t need to find and exclude all stopwords because topic model will help find these (e.g. des, les, une, … ) -- suppress after the fact </li></ul>Preprocessing & Topic Modeling >
  12. 12. Building Vocabulary <ul><li>Preprocessed (sampled) repositories, excluded stopwords </li></ul><ul><li>Only kept words that occurred in more than 10 records </li></ul><ul><li>Result: a final vocabulary with ~ 90,000 words </li></ul><ul><li>Most frequent words: cell, high, energy, protein, function, algorithm, field, theory, physics, … </li></ul><ul><li>Resulting discussion point: When do we need to re-create the vocabulary? (When classifying, new documents will be filtered through existing vocabulary) </li></ul>Preprocessing & Topic Modeling >
  13. 13. Preprocessing & Topic Modeling > <ul><li>Average of 75 words per record </li></ul><ul><li>Bimodal because used records with abstracts and records without abstracts </li></ul><ul><li>Topic model isn’t adversely affected by very short records </li></ul>
  14. 14. Computation <ul><li>Clustering (Learning) </li></ul><ul><ul><ul><li>D = 750,000 records </li></ul></ul></ul><ul><ul><ul><li>W = 90,000 word vocabulary </li></ul></ul></ul><ul><ul><ul><li>L = 75 words per record </li></ul></ul></ul><ul><ul><ul><li>T = 500 topics </li></ul></ul></ul><ul><ul><ul><li>iter = 500 iterations </li></ul></ul></ul><ul><ul><ul><li>memory = 3DL + T(D+W) = 3 GByte </li></ul></ul></ul><ul><ul><ul><li>time = D L T Iter = 3 days (3 GHz Xeon) </li></ul></ul></ul><ul><li>Classification </li></ul><ul><ul><ul><li>D = 3,000,000 records total </li></ul></ul></ul><ul><ul><ul><li>iter = 40 iterations </li></ul></ul></ul><ul><ul><ul><li>max memory = 2 GByte </li></ul></ul></ul><ul><ul><ul><li>max time = 5 hours (but repositories can run in parallel) </li></ul></ul></ul>Decision point: How many topics? Decision point: How many iterations? Preprocessing & Topic Modeling >
  15. 15. Broad Topical Categories <ul><li>500 topics too many to look at </li></ul><ul><li>Need to organize topics under broad topical categories </li></ul><ul><ul><li>Cluster the clusters (automatic) </li></ul></ul><ul><ul><li>Use pre-defined categories </li></ul></ul><ul><ul><ul><li>Classify group of keywords (manual + automatic) </li></ul></ul></ul><ul><ul><ul><li>Create hierarchy by hand (manual) </li></ul></ul></ul>Preprocessing & Topic Modeling >
  16. 16. Broad Topical Categories broad topical categories Preprocessing & Topic Modeling > vocab- ulary preprocess topic model (cluster/learn) topics OAI records topic model (cluster/learn) Cluster Cluster the clusters
  17. 17. Broad Topical Categories Cluster broad topical categories Cluster the clusters Classify group of keywords vocab -ulary preprocess topic model (classify) topics organized under broad topical categories group of keywords Preprocessing & Topic Modeling > vocab- ulary preprocess topic model (cluster/learn) topics OAI records topic model (cluster/learn)
  18. 18. Clustering, Classification, and Metadata Enhancement Techniques on OAI Records <ul><li>Preprocessing and Topic Modeling </li></ul><ul><li>The “Browser” </li></ul><ul><li>Lessons Learned and Next Steps </li></ul>
  19. 19. The “Browser” <ul><li>PHP/MySQL browser of 3 million OAI records* </li></ul><ul><li>Preserving transparency for this audience </li></ul><ul><li>Browser not meant for end users </li></ul><ul><li>No search, no information architecture, etc. </li></ul><ul><li> </li></ul>*Based on 750,000 sampled records from 9 repositories, 500 topics The Browser >
  20. 20. The “Browser”: The Browser >
  21. 21. Selected Topics: Useful <ul><li>[ t201 ]   learning machine training learn algorithm task examples reinforcement inductive learned learner supervised unsupervised </li></ul><ul><li>[ t482 ]   labor worker employment wage market labour job unemployment wages earning panel find evidence individual participation </li></ul><ul><li>[ t381 ]   algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math </li></ul><ul><li>[ t097 ]   dark matter universe astrophysic cosmological cosmic background density inflation spectrum power scale cmb halo cosmology gravitational </li></ul><ul><li>[ t027 ]   hiv virus human immunodeficiency type envelope infection viral cd4 infected gag replication reverse aid tat gp120 </li></ul><ul><li>[ t365 ]   waste radioactive wastes tank nuclear facilities management hanford disposal fuel storage material processing facility site level </li></ul><ul><li>> show all 500 sub-topics (to see all 500 topics) </li></ul>The Browser >
  22. 22. Selected Topics: Less Useful <ul><li>[ t255 ]   journal author chapter vol notes editor publication issue special bibliography reader references appendix literature submitted topic </li></ul><ul><li>[ t328 ]   paul mark thank andrew scott stephen alan steven miller george martin obituaries thesis daniel prof ian </li></ul><ul><li>[ t384 ]   supported part grant author foundation partially contract science national nsf support advanced ccr provided center agency </li></ul><ul><li>[ t112 ]   look people difficult thing need want fact reason help understand think say alway try easy bad </li></ul><ul><li>[ t496 ]   increase increased increases decrease increasing decreased decreases observed change decreasing significant caused decline </li></ul><ul><li>[ t012 ]   des les dan une est par sur pour qui nous sont aux ces analyse pay cette </li></ul><ul><li>But junk topics alleviate the need to exhaustively find stopwords; </li></ul><ul><li>many useless words cluster as topics which can be suppressed </li></ul>and very useful to filter out French records The Browser >
  23. 23. Broad Topical Categories (BTCs) <ul><li>By clustering the clusters </li></ul><ul><ul><li>worked well </li></ul></ul><ul><ul><li>mathematics , global energy resources , … </li></ul></ul><ul><ul><li>can choose desired number of broad topical categories (e.g., 25) and thresholding </li></ul></ul><ul><li>By classifying groups of keywords </li></ul><ul><ul><li>worked well too </li></ul></ul><ul><li>Then review and manually edit </li></ul><ul><ul><li>include or exclude any subtopic </li></ul></ul>The Browser >
  24. 24. BTCs: Clustering the clusters The Browser >
  25. 25. BTCs: Classifying group of keywords <ul><li>>>> Aerospace Engineering </li></ul><ul><li>stars (15) </li></ul><ul><li>space (18) </li></ul><ul><li>aeronautics (20) </li></ul><ul><li>astronautics (20) </li></ul><ul><li>rocket (12) </li></ul><ul><li>shuttle (12) </li></ul><ul><li>exploration (15) </li></ul><ul><li>lander (3) </li></ul><ul><li>planets (7) </li></ul><ul><li>black holes (7) </li></ul><ul><li>quasars (7) </li></ul><ul><li>pulsars (7) </li></ul><ul><li>observatories (10) </li></ul><ul><li>air traffic (10) </li></ul><ul><li>aircraft (15) </li></ul><ul><li>aerospace (20) </li></ul><ul><li>airplanes (10) </li></ul><ul><li>airports (10) </li></ul><ul><li>heliports (10) </li></ul><ul><li>helicopters (10) </li></ul><ul><li>aviation (18) </li></ul><ul><li>FAA (7) </li></ul><ul><li>airlines (12) </li></ul><ul><li>flight (18) </li></ul><ul><li>comets (10) </li></ul><ul><li>meteorites (12) </li></ul><ul><li>spacecraft (15) </li></ul><ul><li>air force (7) </li></ul><ul><li>pilots (7) </li></ul><ul><li>jets (7) </li></ul><ul><li>air travel (15) </li></ul><ul><li>flying (18) </li></ul>domain expert specifies list of relevant keywords and (importance) The Browser >
  26. 26. BTCs: Classifying group of keywords <ul><li>>>> Aerospace Engineering </li></ul><ul><li>[t192] (69%) vehicle flight vehicles engine car road speed nasa aircraft air </li></ul><ul><li>[t352] (13%) star solar planet mass astrophysic binary dwarf orbital sun companion </li></ul><ul><li>[t191] (8%) space spaces hilbert subspace dimensional subspaces defined exploration linear point </li></ul><ul><li>>>> Dermatology </li></ul><ul><li>[t388] (83%) infection skin disease tract respiratory fever burgdorferi caused wound arthritis </li></ul><ul><li>[t157] (8%) cancer tumor p53 breast carcinoma survival human tumour malignant prostate </li></ul><ul><li>[t071] (7%) growth tuberculosis mycobacterium growing grow igf factor bcg avium </li></ul><ul><li>>>> Geology and Earth Sciences </li></ul><ul><li>[t121] (73%) geothermal rock seismic energy mountain drilling fluid survey spring yucca </li></ul><ul><li>[t268] (12%) sea atmospheric climate ice ocean atmosphere cloud global wind aerosol </li></ul><ul><li>>>> Molecular, Cellular and Developmental Biology </li></ul><ul><li>[t276] (31%) molecular biological sciences molecules biology molecule quantitative biochemistry basic </li></ul><ul><li>[t417] (15%) cell apoptosis cellular death cultured bcl lines hela transfected mediated </li></ul><ul><li>[t355] (12%) brain neuron neuronal cortex synaptic cortical rat nervous cerebral dopamine </li></ul><ul><li>[t418] (9%) genes genome gene repeat chromosome sequences dna genomic sequence region </li></ul><ul><li>[t319] (7%) mice development mouse drosophila expression transgenic cell embryonic embryos gene </li></ul><ul><li>>>> Transportation </li></ul><ul><li>[t192] (85%) vehicle flight vehicles engine car road speed nasa aircraft air </li></ul>in review, would delete this topic from this BTC just found 1 topic relevant to transportation The Browser >
  27. 27. Browse Records in a Topic nice mix of repositories The Browser > can navigate back to multiple BTCs
  28. 28. Browse Records in a Topic: From one repository The Browser > display records just from Library of Congress
  29. 29. Sample Record <ul><li>Murphy's Law in algebraic geometry: Badly-behaved deformation spaces </li></ul><ul><ul><li>> preprocessed text </li></ul></ul><ul><ul><li>murphy law algebraic geometry badly behaved deformation spaces </li></ul></ul><ul><ul><li>consider question bad deformation space object answer priori reason deformation space bad moduli spaces </li></ul></ul><ul><ul><li>precisely singularity finite type smooth parameter hilbert scheme curves projective space moduli spaces </li></ul></ul><ul><ul><li>smooth projective type surfaces higher dimensional varieties plane curves nodes cusp stable sheaves </li></ul></ul><ul><ul><li>isolated threefold singularities object pathological fact nice curves smooth surfaces ample canonical bundle </li></ul></ul><ul><ul><li>stable sheaves torsion free rank singularities normal cohen macaulay justifies mumford philosophy moduli </li></ul></ul><ul><ul><li>spaces behaved object arbitrarily bad priori reason construct smooth curve projective space deformation </li></ul></ul><ul><ul><li>space component singularity type reduced behavior subschemes similarly give surface f_p lift course hold </li></ul></ul><ul><ul><li>holomorphic category difficult compute deformation spaces directly obstruction theories circumvent relating </li></ul></ul><ul><ul><li>tractable deformation spaces smooth morphism essential starting point mnev universality theorem </li></ul></ul><ul><ul><li>mathematic algebraic geometry mathematic complex variables </li></ul></ul><ul><li>> top topics </li></ul><ul><li>[ t381 ] algebraic geometry mathematic conjecture varieties projective variety theory cohomology moduli curves prove genus rational give math [ t191 ] space spaces </li></ul><ul><li> </li></ul>The Browser > link to actual OAI record topics for this record
  30. 30. Repository-specific Browsers <ul><li>Library of Congress ( ) </li></ul><ul><li>University of Michigan ( ) </li></ul><ul><li>University of Washington ( ) </li></ul><ul><li>African Journals Online ( ) </li></ul><ul><li>and many more… </li></ul>The Browser >
  31. 31. Clustering, Classification, and Metadata Enhancement Techniques on OAI Records <ul><li>Preprocessing and Topic Modeling </li></ul><ul><li>The “Browser” </li></ul><ul><li>Lessons Learned and Next Steps </li></ul>
  32. 32. Evaluation <ul><li>Topic modeling worked well </li></ul><ul><ul><li>Most topics were useful </li></ul></ul><ul><ul><li>Drain on computer resources was reasonable </li></ul></ul><ul><ul><li>Human effort was relatively small </li></ul></ul><ul><ul><li>All repositories processed identically, no special treatment </li></ul></ul><ul><li>Strategy worked well </li></ul><ul><ul><li>Clustering, then </li></ul></ul><ul><ul><li>Classification, and </li></ul></ul><ul><ul><li>Broad Topical Categories creation </li></ul></ul>Lessons Learned & Next Steps >
  33. 33. Further Evaluation <ul><li>Current processing only for </li></ul><ul><ul><li>English-language repositories </li></ul></ul><ul><ul><li>Science/research based repositories </li></ul></ul><ul><li>Need to test cultural heritage repositories and foreign-language records </li></ul><ul><ul><li>Less consistent descriptive language and length </li></ul></ul><ul><ul><li>“ On-the-horse” problem more prevalent </li></ul></ul><ul><ul><li>Greater need to individually process repositories </li></ul></ul><ul><li>Also need usability testing to evaluate further </li></ul><ul><ul><li>Depends on criteria -- who are users? </li></ul></ul><ul><ul><ul><li>Librarians? </li></ul></ul></ul><ul><ul><ul><li>End-users? </li></ul></ul></ul><ul><ul><li>Depends on products and services desired by users </li></ul></ul>Lessons Learned & Next Steps >
  34. 34. Discussion Point: When to Re-cluster? <ul><li>Need to re-cluster </li></ul><ul><ul><li>when collection changes significantly </li></ul></ul><ul><ul><li>if there is a “hole” in topics </li></ul></ul><ul><ul><li>but NOT just because you have another repository </li></ul></ul><ul><li>If you re-cluster </li></ul><ul><ul><li>all topics will be different </li></ul></ul><ul><ul><li>have to discard hand-labeling </li></ul></ul><ul><ul><li>Broad Topical Categories might be different </li></ul></ul><ul><li>However, classification is </li></ul><ul><ul><li>“ cheap” and easy </li></ul></ul><ul><ul><li>e.g., for OAIster, could re-classify every harvest…until spring clean </li></ul></ul>cluster classify cluster cluster classify classify classify classify classify Lessons Learned & Next Steps >
  35. 35. Products and Services <ul><li>Depending on users… </li></ul><ul><li>What kind of service is useful? </li></ul><ul><li>What should interface to topics look/act like? </li></ul><ul><li>What kind of use should we envision? </li></ul><ul><li>We have some ideas… </li></ul>Lessons Learned & Next Steps >
  36. 36. Archive of Topics <ul><li>Are the topics we created useful to anyone else? </li></ul><ul><li>Scenario: librarian uses topics/classifier for local resources </li></ul><ul><li>To use locally you need: </li></ul><ul><ul><li>the preprocessor (i.e. the preprocessing rules) </li></ul></ul><ul><ul><li>the vocabulary (file of 90,000 words) </li></ul></ul><ul><ul><li>the topic model classifier </li></ul></ul>Lessons Learned & Next Steps >
  37. 37. Subject Search/Browse for OAIster <ul><li>Integrate topics into OAIster </li></ul><ul><ul><li>add to records so can perform canned topic search </li></ul></ul><ul><ul><li>add to interface so can browse BTCs to records </li></ul></ul><ul><li>Additionally, can allow users to find records similar to those retrieved </li></ul><ul><ul><li>e.g., retrieved records on cosmology and can find similar records on astrophysics, relativity, … </li></ul></ul><ul><li>How to do this? </li></ul>Lessons Learned & Next Steps >
  38. 38. How To Reach Us <ul><li>David Newman: University of California, Irvine </li></ul><ul><li>< newman @ uci . edu > </li></ul><ul><li>Kat Hagedorn: University of Michigan </li></ul><ul><li>< [email_address] > </li></ul><ul><li>Bill Landis: California Digital Library </li></ul><ul><li>< [email_address] > </li></ul>