Text Analytics on 2 Million Documents: A Case Study

2,918 views
2,781 views

Published on

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,918
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
97
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide
  • KEA performs better than 8 of the best taggers
  • Dev machine: 4 Core CPU, 8 RAMRackspace
  • So among the top 10 keywords from the full document, 91% appear in the keywords from the chopped document (so, basically 9 out of 10 are the same),
  • Costs ~250 USD
  • Text Analytics on 2 Million Documents: A Case Study

    1. 1. Text Analytics World, Boston, October 3-4, 2012Text Analytics on 2 MillionDocuments: A Case StudyPlus, An Introduction into Keyword ExtractionAlyona Medelyan
    2. 2. What are these books about?“Because he could” by D. Morris, E. McGann“Still stripping after 25 years” by E. Burns“Glut” by A. Wright Only metadata will tell…
    3. 3. What this talk will cover:• Who am I & my relation to the topic• What types of keyword extraction are out there• How does keyword extraction work• How accurate can keywords be• How to analyze 2 million documents efficiently
    4. 4. My Background @zelandiya medelyan.com 2005-2009 PhD Thesis on keyword extraction “Human-competitive automatic topic indexing” Maui Multi-purpose automatic topic indexing nzdl.org/kea/ maui-indexer.googlecode.com 2010 co-organized keyword extraction competitionSemEval-2 SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles” 2010-2012 leading the R&D of Pingar’s text analytics API Pingar API features: keyword & named entities extraction, summarization etc.
    5. 5. Findability is ensured with the help of metadata Document Easy to extract: Metadata Title, file type & location, creation & modification date, authors, publisher Difficult to extract: Keywords & keyphrases, people & companies mentioned, suppliers & addresses mentioned
    6. 6. What can text analytics determine from text?focus of this presentation keywords text text text tags text text text sentiment text text text text text text text text text text text text genre categories taxonomy terms entities names biochemical patterns … entities text text text text text text text text text text text text text text text text text text
    7. 7. Types of keyword extraction (or topic indexing)• Subject headings in libraries • general with Library of Congress Subject Headings • domain-specific in PubMed with MeSH categories taxonomy terms controlled indexing• Keyphrases in academic publications keywords tags• Tags in folksonomies • by authors on Technorati • by users on Del.icio.us free indexing
    8. 8. Free indexing Controlled indexingE.g. keywords, tags E.g. LCSH, ACM, MeSHInconsistent RestrictedNo control Centrally controlledNo semantics InflexibleAd hoc Not always available
    9. 9. How keyword extraction works Document Candidates Keywords1. Extract phrases using the sliding window approachNEJM usually has the highest impact factor of the journals of clinical medicine. ignore Alternative approach: stopwords a) Assign part-of-speech tags b) Extract valid noun phrases (NPs)NEJMhighest, highest impact, highest impact factorimpact, impact factor…
    10. 10. How keyword extraction works Document Candidates Keywords2. Normalize phrases (case folding, stemming etc.)NEJM usually has the highest impact factor of the journals of clinical medicine.NEJM nejm New England J of Medhighest high -highest impact factor high impact factor -impact impact -impact factor impact factor Impact Factorjournals journal Journaljournals of clinical journal of clinic -clinical clinic Clinicclinical medicine clinic medic Medicinemedicine medic Medicine
    11. 11. How keyword extraction worksDocument Candidates Properties Keywords 1. Frequency: number of occurrences (incl. synonyms) 2. Position: beginning/end of a document, title, headers 3. Phrase length: longer means more specific 4. Similarity: semantic relatedness to other candidates 5. Corpus statistics: how prominent in this particular text 6. Popularity: how often people select this candidate 7. Part of speech pattern: some patterns are more common …
    12. 12. How keyword extraction worksDocument Candidates Properties Scoring KeywordsHeuristics Supervised machine learning A formula that combines most  Train a model from manuallypowerful features indexed documents• requires accurate crafting • requires training data• performs equaly well or less well • performs really well on docs thatacross various domains are similar to training data, but poorly on dissimilar ones
    13. 13. How accurate is keyword extraction?• It’s subjective…• But: the higher the indexing consistency is, the better the search effectiveness (findability) A – set of keyphrases 1 A B – set of keyphrases 2 C – set of keyphrases in common C ConsistencyRolling = 2C / (A + B) B ConsistencyHopper = C / (A + B – C)
    14. 14. Professional indexers’ keywords*Agrovoc terms: energy public value nutritional health disorders regulations weight reduction nutrient disease developing excesses control countries nutritional diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutrition surveillance overweight food nutritional policies price physiology formation food overeating intake human nutrition nutrition policies price foods food fiscal policies consumption policies prices direct urbanization globalization taxation taxes * 6 professional FAO indexers assigned terms from the Agrovoc thesaurus to the same document, entitled “The global obesity problem”
    15. 15. Comparison of 2 indexersAgrovoc terms: energy publicAgrovoc relation: value nutritional health disorders regulationsIndexer 1: weight reduction nutrient disease developingIndexer 2: excesses countries control nutritional diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutrition surveillance overweight food nutritional policies price physiology formation food overeating intake human nutrition nutrition policies price foods food fiscal policies consumption policies prices direct urbanization globalization taxation taxes
    16. 16. Comparison of 6 indexers & Kea Agrovoc terms: energy public Agrovoc relation: value nutritional health disorders regulations Indexers: weight reduction nutrient 1 2 3 4 5 6 disease developing excesses control countries nutritional Kea Algorithm: diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutritionbody weight overweight surveillance food nutritional policies price physiology formation price fixing saturated fat food overeating intake human nutrition nutrition policies controlled prices foods food price policies consumption fiscal policies policies prices direct urbanization globalization taxation taxes
    17. 17. Comparison of CS students* & Maui * 15 teams of 2 students each assigned keywords to the same document, entitled “A safe, efficient regression test selection technique”
    18. 18. Human vs. algorithm consistency6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus Method Min Avg Max Professionals 26 39 47 KEA 24 32 3815 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary Method Min Avg Max Students 21 31 37 Maui 24 32 36CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing With other taggers With Maui330 taggers & 180 docs 19 2435 taggers & 140 docs 38 35
    19. 19. Text Analytics on 2 Million Documents: A Case Study + Collaboration with Gene Golovchinsky fxpal.com/?p=gene
    20. 20. The dataset Twitter 490 Million CiteSeer tweets per 1.7 Million week scientific 84 GB publications 110 GB Wikipedia 3.6 Million articles 13 GB Britannica 0.65 Million articlesICWSM 2011 0.3 GB2.1 TB (compressed!)News, blogs, forums, etc. slideshare.net/raffikrikorian/twitter-by-the-numbers en.wikipedia.org/wiki/Wikipedia:Size_comparisons
    21. 21. The task goal1. Extract all phrases that appear in search results2. Weigh and suggest the best phrases for query refinement Gene’s collaborative search system Querium
    22. 22. Step 1: Get time estimatesA. Take a subset, e.g. 100 documentsB. Run on various machines / settingsC. Extrapolate to the entire dataset, e.g. 1.7M docs Our example: • Standard laptop 4 Core, 8GB RAM: 30 days • Similar Rackspace VM: 46 days • Threading reduces time: 24 days
    23. 23. Step 2: Look into your dataUnderstand the nature of your data: look at samples, compute statistcs.Speed up by removing anomalies & targetting the text analytics. Our example: 30% docs exceed 50KB (some ≈600KB) Most important phrase appear in title, abstract, introduction and conclusions.  Only process top 30% and last 20% This reduces the time by 57%!
    24. 24. Validate: Can we crop our documents? Top 20 keywords from*… …original document ...cropped document Top N How many were ontology ontology knowledge base knowledge basekeywords in found in the knowledge knowledge engineeringoriginal doc cropped doc representation knowledge Semantic Web representation 10 91% WordNet WordNet 50 80% knowledge engineering predicate logic predicate logic artificial intelligence 100 75% artificial intelligence ontology engineering semantic networks semantic networks All 64% natural language Semantic Web first-order logic first-order logic ontology engineering block diagram lexicon dynamic systems conceptual graphs higher-order logic higher-order logic conceptual graphs natural language modeling & simulation processing universe of discourse * Toward principles for the design of design rationale bond graph ontologies used for knowledge sharing block diagram lexicon T. R. Gruber (1993)
    25. 25. Step 3: Go cloudDon’t be afraid to bring out the big guns • Large Elastic Compute instance 1000 docs x 4 threads = 30 min • High-CPU Extra Large (8 virtual cores) 1000 docs x 24 threads = 6 minAlso: increase the number of machines • 4 machines = 4 times faster, i.e. 50 instead of 200 hours (or 1 weekend!)
    26. 26. How long would a human need to extract keywords from 1.7M docs? Min per Min Hours Days* Years** doc 1 1.700.000 28.333 3.542 14 2 3.400.000 56.666 7.083 28 3 5.100.000 85.000 10.625 42* Taking into account 8h per working day** Assuming 250 working days per year (no holidays, no sickdays) http://www.flickr.com/photos/mararie/2663711551/
    27. 27. Document Candidates Properties Scoring Keywords To estimate quality, take a sample and compute inter-indexer consistency between several people CiteSeer 1.7 Million scientific publications 110 GB 1. Get time estimates Can be done 2. Look into your data in a weekend 3. Go cloud Don’t do it manually!Keyword extraction : medelyan.com/files/phd2009.pdfCiteSeer study: pingar.com/technical-blog/Pingar API: apidemo.pingar.com

    ×