SlideShare a Scribd company logo
1 of 27
Text Analytics World, Boston, October 3-4, 2012




Text Analytics on 2 Million
Documents: A Case Study
Plus, An Introduction into Keyword Extraction

Alyona Medelyan
What are these books about?
“Because he could” by D. Morris, E. McGann

“Still stripping after 25 years” by E. Burns

“Glut” by A. Wright




                   Only metadata will tell…
What this talk will cover:
•   Who am I & my relation to the topic
•   What types of keyword extraction are out there
•   How does keyword extraction work
•   How accurate can keywords be
•   How to analyze 2 million documents efficiently
My Background

                                                                              @zelandiya
                                                                             medelyan.com
            2005-2009 PhD Thesis on keyword extraction
            “Human-competitive automatic topic indexing”
                                                               Maui
                                                               Multi-purpose
                                                               automatic topic indexing


                 nzdl.org/kea/                       maui-indexer.googlecode.com



            2010 co-organized keyword extraction competition
SemEval-2   SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”



            2010-2012 leading the R&D of Pingar’s text analytics API
            Pingar API features: keyword & named entities extraction, summarization etc.
Findability is ensured with the help of metadata



 Document                Easy to extract:                Metadata
                         Title, file type & location,
                         creation & modification date,
                         authors, publisher

            Difficult to extract:
            Keywords & keyphrases,
            people & companies mentioned,
            suppliers & addresses mentioned
What can text analytics determine from text?
focus of this presentation




                               keywords                      text text text

                                 tags
                                                             text text text

                                                                                              sentiment
                                                             text text text
                                                             text text text
                                                             text text text
                                                             text text text




                                                                                                 genre

                                categories
                             taxonomy terms
                                                       entities


                                          names                               biochemical
                                                  patterns        …             entities
                                                                                            text text text
                                                                                            text text text
                                                                                            text text text
                                                                                            text text text
                                                                                            text text text
                                                                                            text text text
Types of keyword extraction (or topic indexing)


• Subject headings in libraries
   • general with Library of Congress Subject Headings
   • domain-specific in PubMed with MeSH                    categories
                                                         taxonomy terms
    controlled indexing

• Keyphrases in academic publications
                                                          keywords
                                                            tags
• Tags in folksonomies
   • by authors on Technorati
   • by users on Del.icio.us

    free indexing
Free indexing         Controlled indexing
E.g. keywords, tags   E.g. LCSH, ACM, MeSH
Inconsistent          Restricted
No control            Centrally controlled
No semantics          Inflexible
Ad hoc                Not always available
How keyword extraction works
 Document        Candidates                                                 Keywords




1. Extract phrases using the sliding window approach
NEJM usually has the highest impact factor of the journals of clinical medicine.


      ignore                                      Alternative approach:
    stopwords                                     a) Assign part-of-speech tags
                                                  b) Extract valid noun phrases (NPs)

NEJM
highest, highest impact, highest impact factor
impact, impact factor…
How keyword extraction works
 Document         Candidates                                                Keywords




2. Normalize phrases (case folding, stemming etc.)
NEJM usually has the highest impact factor of the journals of clinical medicine.
NEJM                              nejm                          New England J of Med
highest                           high                          -
highest impact factor             high impact factor            -
impact                            impact                        -
impact factor                     impact factor                 Impact Factor
journals                          journal                       Journal
journals of clinical              journal of clinic             -
clinical                          clinic                        Clinic
clinical medicine                 clinic medic                  Medicine
medicine                          medic                         Medicine
How keyword extraction works
Document    Candidates        Properties                         Keywords




           1.   Frequency: number of occurrences (incl. synonyms)
           2.   Position: beginning/end of a document, title, headers
           3.   Phrase length: longer means more specific
           4.   Similarity: semantic relatedness to other candidates
           5.   Corpus statistics: how prominent in this particular text
           6.   Popularity: how often people select this candidate
           7.   Part of speech pattern: some patterns are more common
           …
How keyword extraction works
Document        Candidates            Properties       Scoring          Keywords




Heuristics                                    Supervised machine learning

 A formula that combines most                 Train a model from manually
powerful features                             indexed documents

• requires accurate crafting                  • requires training data
• performs equaly well or less well           • performs really well on docs that
across various domains                        are similar to training data, but
                                              poorly on dissimilar ones
How accurate is keyword extraction?
• It’s subjective…
• But: the higher the indexing consistency is,
  the better the search effectiveness (findability)

                     A – set of keyphrases 1
         A           B – set of keyphrases 2
                     C – set of keyphrases in common
         C
                     ConsistencyRolling = 2C / (A + B)
         B
                     ConsistencyHopper = C / (A + B – C)
Professional indexers’ keywords*
Agrovoc terms:                                      energy                                      public
                                                    value           nutritional                 health
                                                                    disorders                                 regulations
                                            weight
                                            reduction                       nutrient           disease           developing
                                                                            excesses           control           countries
                                                             nutritional
                                        diet                 requirements
                     dietary                                                       nutrition             nutrition           developed
                     guidelines                     feeding                        status                programs            countries
                                meal                habits
                             patterns                                                                    nutrition
                                                                                                         surveillance
                 overweight
                                                                                                          food
                                                              nutritional                                 policies                 price
                                                              physiology
                                                                                                                                   formation

                                                 food
                              overeating         intake            human                  nutrition
                                                                   nutrition              policies
                                                                                                                  price
                     foods                       food
                                                                               fiscal                             policies
                                           consumption
                                                                               policies
                                                                                                                                prices
                                                                                                                 direct
                  urbanization          globalization
                                                                                                                 taxation
                                                                                    taxes



                 * 6 professional FAO indexers assigned terms from the Agrovoc thesaurus
                   to the same document, entitled “The global obesity problem”
Comparison of 2 indexers
Agrovoc terms:                                        energy                                       public
Agrovoc relation:                                     value            nutritional                 health
                                                                       disorders                                 regulations
Indexer 1:                                     weight
                                               reduction                       nutrient           disease           developing
Indexer 2:                                                                     excesses                             countries
                                                                                                  control
                                                               nutritional
                                           diet                requirements
                        dietary                                                       nutrition             nutrition           developed
                        guidelines                     feeding                        status                programs            countries
                                   meal                habits
                                patterns                                                                    nutrition
                                                                                                            surveillance
                    overweight
                                                                                                             food
                                                                 nutritional                                 policies                 price
                                                                 physiology
                                                                                                                                      formation

                                                    food
                                 overeating         intake            human                  nutrition
                                                                      nutrition              policies
                                                                                                                     price
                        foods                       food
                                                                                  fiscal                             policies
                                              consumption
                                                                                  policies
                                                                                                                                   prices
                                                                                                                    direct
                     urbanization          globalization
                                                                                                                    taxation
                                                                                       taxes
Comparison of 6 indexers & Kea
   Agrovoc terms:                                              energy                                      public
   Agrovoc relation:                                           value           nutritional                 health
                                                                               disorders                                 regulations
   Indexers:                                           weight
                                                       reduction                       nutrient
   1    2   3 4        5 6                                                                                disease           developing
                                                                                       excesses           control           countries
                                                                        nutritional
  Kea Algorithm:                                   diet                 requirements
                                dietary                                                      nutrition              nutrition           developed
                                guidelines                     feeding                       status                 programs            countries
                                           meal                habits
                                        patterns                                                                    nutrition
body weight              overweight                                                                                 surveillance
                                                                                                                     food
                                                                         nutritional                                 policies                       price
                                                                         physiology
                                                                                                                                                    formation
                                                                                                                            price fixing
       saturated fat                                        food
                                         overeating         intake            human                  nutrition
                                                                              nutrition              policies                      controlled prices
                                foods                       food                                                             price
                                                                                                        policies
                                                      consumption                         fiscal                             policies
                                                                                          policies                                             prices
                                                                                                                            direct
                             urbanization          globalization
                                                                                                                            taxation
                                                                                              taxes
Comparison of CS students* & Maui




  * 15 teams of 2 students each assigned keywords to the same document,
    entitled “A safe, efficient regression test selection technique”
Human vs. algorithm consistency
6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus
 Method                       Min             Avg            Max
 Professionals                26              39             47
 KEA                          24              32             38

15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary
 Method                       Min             Avg            Max
 Students                     21              31             37
 Maui                         24              32             36

CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing
                            With other taggers      With Maui
330 taggers & 180 docs      19                      24
35 taggers & 140 docs       38                      35
Text Analytics on 2 Million Documents:
             A Case Study

                     +

             Collaboration with Gene Golovchinsky
             fxpal.com/?p=gene
The dataset
                                            Twitter
                                          490 Million
                  CiteSeer                tweets per
                  1.7 Million                week
                   scientific                84 GB
                 publications
                    110 GB                    Wikipedia
                                              3.6 Million articles
                                              13 GB

                                              Britannica
                                              0.65 Million articles
ICWSM 2011                                    0.3 GB
2.1 TB (compressed!)
News, blogs, forums, etc.
                                   slideshare.net/raffikrikorian/twitter-by-the-numbers
                                   en.wikipedia.org/wiki/Wikipedia:Size_comparisons
The task goal
1. Extract all phrases that appear in search results
2. Weigh and suggest the best phrases for query refinement




          Gene’s collaborative search system Querium
Step 1: Get time estimates
A. Take a subset, e.g. 100 documents
B. Run on various machines / settings
C. Extrapolate to the entire dataset, e.g. 1.7M docs


        Our example:

        •   Standard laptop 4 Core, 8GB RAM: 30 days
        •   Similar Rackspace VM: 46 days
        •   Threading reduces time: 24 days
Step 2: Look into your data
Understand the nature of your data:
              look at samples, compute statistcs.
Speed up by removing anomalies & targetting the text analytics.


                       Our example:

                       30% docs exceed 50KB (some ≈600KB)

                       Most important phrase appear in title,
                       abstract, introduction and conclusions.

                        Only process top 30% and last 20%
                       This reduces the time by 57%!
Validate: Can we crop our documents?
                                                        Top 20 keywords from*…
                                              …original document    ...cropped document
   Top N          How many were               ontology                  ontology
                                              knowledge base            knowledge base
keywords in        found in the               knowledge                 knowledge engineering
original doc       cropped doc                representation            knowledge
                                              Semantic Web              representation
     10                91%                    WordNet                   WordNet
    50                    80%                 knowledge engineering     predicate logic
                                              predicate logic           artificial intelligence
    100                   75%                 artificial intelligence   ontology engineering
                                              semantic networks         semantic networks
    All                   64%                 natural language          Semantic Web
                                              first-order logic         first-order logic
                                              ontology engineering      block diagram
                                              lexicon                   dynamic systems
                                              conceptual graphs         higher-order logic
                                              higher-order logic        conceptual graphs
                                              natural language          modeling & simulation
                                              processing                universe of discourse
      * Toward principles for the design of   design rationale          bond graph
    ontologies used for knowledge sharing     block diagram             lexicon
                        T. R. Gruber (1993)
Step 3: Go cloud

Don’t be afraid to bring out the big guns

   •   Large Elastic Compute instance
       1000 docs x 4 threads = 30 min

   •   High-CPU Extra Large (8 virtual cores)
       1000 docs x 24 threads = 6 min

Also: increase the number of machines

   •   4 machines = 4 times faster,
       i.e. 50 instead of 200 hours (or 1 weekend!)
How long would a human
 need to extract keywords
 from 1.7M docs?

 Min per               Min          Hours         Days*        Years**
     doc
       1       1.700.000           28.333          3.542                14
         2     3.400.000           56.666          7.083                28
         3     5.100.000           85.000        10.625                 42

* Taking into account 8h per working day
** Assuming 250 working days per year (no holidays, no sickdays)




                                                                   http://www.flickr.com/photos/mararie/2663711551/
Document       Candidates        Properties          Scoring         Keywords



                              To estimate quality, take a sample and compute
                              inter-indexer consistency between several people


                               CiteSeer
                                1.7 Million
                                 scientific
                               publications
                                  110 GB      1.   Get time estimates
                               Can be done    2.   Look into your data
                              in a weekend    3.   Go cloud

                                               Don’t do it manually!

Keyword extraction : medelyan.com/files/phd2009.pdf
CiteSeer study: pingar.com/technical-blog/
Pingar API: apidemo.pingar.com

More Related Content

Similar to Text Analytics Case Study on 2 Million Documents

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Content analysis
Content analysisContent analysis
Content analysisAtul Thakur
 
Content analysis
Content analysisContent analysis
Content analysisAtul Thakur
 
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...Dr. Haxel Consult
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IUNCResearchHub
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence Marina Santini
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methodsAkanshShandilya
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013Melanie Parlette-Stewart
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Kai Li
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 

Similar to Text Analytics Case Study on 2 Million Documents (20)

Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
NLP todo
NLP todoNLP todo
NLP todo
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Content analysis
Content analysisContent analysis
Content analysis
 
Content analysis
Content analysisContent analysis
Content analysis
 
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
 
Qualitative Data Analysis I: Text Analysis
Qualitative Data Analysis I: Text AnalysisQualitative Data Analysis I: Text Analysis
Qualitative Data Analysis I: Text Analysis
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
SearchInFocus: Exploratory Study on Query Logs and Actionable Intelligence
 
It services & research methods
It services & research methodsIt services & research methods
It services & research methods
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
COMM1170 - MSE - Level 8 (Alcock and Guo) March 2013
 
Textmining
TextminingTextmining
Textmining
 
Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...Using a keyword extraction pipeline to understand concepts in future work sec...
Using a keyword extraction pipeline to understand concepts in future work sec...
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
[IJET-V1I6P17] Authors : Mrs.R.Kalpana, Mrs.P.Padmapriya
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 

More from Peter Wren-Hilton

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataPeter Wren-Hilton
 
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured DataPeter Wren-Hilton
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesPeter Wren-Hilton
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics Peter Wren-Hilton
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Peter Wren-Hilton
 

More from Peter Wren-Hilton (6)

How Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big dataHow Taxonomies and facets bring end users closer to big data
How Taxonomies and facets bring end users closer to big data
 
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
 
Discover New Value from Unstructured Data
Discover New Value from Unstructured DataDiscover New Value from Unstructured Data
Discover New Value from Unstructured Data
 
Search interface feature evaluation in biosciences
Search interface feature evaluation in biosciencesSearch interface feature evaluation in biosciences
Search interface feature evaluation in biosciences
 
The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics The Next-Generation SharePoint: Powered by Text Analytics
The Next-Generation SharePoint: Powered by Text Analytics
 
Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010Pingar Metadata Extraction in SharePoint 2010
Pingar Metadata Extraction in SharePoint 2010
 

Text Analytics Case Study on 2 Million Documents

  • 1. Text Analytics World, Boston, October 3-4, 2012 Text Analytics on 2 Million Documents: A Case Study Plus, An Introduction into Keyword Extraction Alyona Medelyan
  • 2. What are these books about? “Because he could” by D. Morris, E. McGann “Still stripping after 25 years” by E. Burns “Glut” by A. Wright Only metadata will tell…
  • 3. What this talk will cover: • Who am I & my relation to the topic • What types of keyword extraction are out there • How does keyword extraction work • How accurate can keywords be • How to analyze 2 million documents efficiently
  • 4. My Background @zelandiya medelyan.com 2005-2009 PhD Thesis on keyword extraction “Human-competitive automatic topic indexing” Maui Multi-purpose automatic topic indexing nzdl.org/kea/ maui-indexer.googlecode.com 2010 co-organized keyword extraction competition SemEval-2 SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles” 2010-2012 leading the R&D of Pingar’s text analytics API Pingar API features: keyword & named entities extraction, summarization etc.
  • 5. Findability is ensured with the help of metadata Document Easy to extract: Metadata Title, file type & location, creation & modification date, authors, publisher Difficult to extract: Keywords & keyphrases, people & companies mentioned, suppliers & addresses mentioned
  • 6. What can text analytics determine from text? focus of this presentation keywords text text text tags text text text sentiment text text text text text text text text text text text text genre categories taxonomy terms entities names biochemical patterns … entities text text text text text text text text text text text text text text text text text text
  • 7. Types of keyword extraction (or topic indexing) • Subject headings in libraries • general with Library of Congress Subject Headings • domain-specific in PubMed with MeSH categories taxonomy terms controlled indexing • Keyphrases in academic publications keywords tags • Tags in folksonomies • by authors on Technorati • by users on Del.icio.us free indexing
  • 8. Free indexing Controlled indexing E.g. keywords, tags E.g. LCSH, ACM, MeSH Inconsistent Restricted No control Centrally controlled No semantics Inflexible Ad hoc Not always available
  • 9. How keyword extraction works Document Candidates Keywords 1. Extract phrases using the sliding window approach NEJM usually has the highest impact factor of the journals of clinical medicine. ignore Alternative approach: stopwords a) Assign part-of-speech tags b) Extract valid noun phrases (NPs) NEJM highest, highest impact, highest impact factor impact, impact factor…
  • 10. How keyword extraction works Document Candidates Keywords 2. Normalize phrases (case folding, stemming etc.) NEJM usually has the highest impact factor of the journals of clinical medicine. NEJM nejm New England J of Med highest high - highest impact factor high impact factor - impact impact - impact factor impact factor Impact Factor journals journal Journal journals of clinical journal of clinic - clinical clinic Clinic clinical medicine clinic medic Medicine medicine medic Medicine
  • 11. How keyword extraction works Document Candidates Properties Keywords 1. Frequency: number of occurrences (incl. synonyms) 2. Position: beginning/end of a document, title, headers 3. Phrase length: longer means more specific 4. Similarity: semantic relatedness to other candidates 5. Corpus statistics: how prominent in this particular text 6. Popularity: how often people select this candidate 7. Part of speech pattern: some patterns are more common …
  • 12. How keyword extraction works Document Candidates Properties Scoring Keywords Heuristics Supervised machine learning  A formula that combines most  Train a model from manually powerful features indexed documents • requires accurate crafting • requires training data • performs equaly well or less well • performs really well on docs that across various domains are similar to training data, but poorly on dissimilar ones
  • 13. How accurate is keyword extraction? • It’s subjective… • But: the higher the indexing consistency is, the better the search effectiveness (findability) A – set of keyphrases 1 A B – set of keyphrases 2 C – set of keyphrases in common C ConsistencyRolling = 2C / (A + B) B ConsistencyHopper = C / (A + B – C)
  • 14. Professional indexers’ keywords* Agrovoc terms: energy public value nutritional health disorders regulations weight reduction nutrient disease developing excesses control countries nutritional diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutrition surveillance overweight food nutritional policies price physiology formation food overeating intake human nutrition nutrition policies price foods food fiscal policies consumption policies prices direct urbanization globalization taxation taxes * 6 professional FAO indexers assigned terms from the Agrovoc thesaurus to the same document, entitled “The global obesity problem”
  • 15. Comparison of 2 indexers Agrovoc terms: energy public Agrovoc relation: value nutritional health disorders regulations Indexer 1: weight reduction nutrient disease developing Indexer 2: excesses countries control nutritional diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutrition surveillance overweight food nutritional policies price physiology formation food overeating intake human nutrition nutrition policies price foods food fiscal policies consumption policies prices direct urbanization globalization taxation taxes
  • 16. Comparison of 6 indexers & Kea Agrovoc terms: energy public Agrovoc relation: value nutritional health disorders regulations Indexers: weight reduction nutrient 1 2 3 4 5 6 disease developing excesses control countries nutritional Kea Algorithm: diet requirements dietary nutrition nutrition developed guidelines feeding status programs countries meal habits patterns nutrition body weight overweight surveillance food nutritional policies price physiology formation price fixing saturated fat food overeating intake human nutrition nutrition policies controlled prices foods food price policies consumption fiscal policies policies prices direct urbanization globalization taxation taxes
  • 17. Comparison of CS students* & Maui * 15 teams of 2 students each assigned keywords to the same document, entitled “A safe, efficient regression test selection technique”
  • 18. Human vs. algorithm consistency 6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus Method Min Avg Max Professionals 26 39 47 KEA 24 32 38 15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary Method Min Avg Max Students 21 31 37 Maui 24 32 36 CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing With other taggers With Maui 330 taggers & 180 docs 19 24 35 taggers & 140 docs 38 35
  • 19. Text Analytics on 2 Million Documents: A Case Study + Collaboration with Gene Golovchinsky fxpal.com/?p=gene
  • 20. The dataset Twitter 490 Million CiteSeer tweets per 1.7 Million week scientific 84 GB publications 110 GB Wikipedia 3.6 Million articles 13 GB Britannica 0.65 Million articles ICWSM 2011 0.3 GB 2.1 TB (compressed!) News, blogs, forums, etc. slideshare.net/raffikrikorian/twitter-by-the-numbers en.wikipedia.org/wiki/Wikipedia:Size_comparisons
  • 21. The task goal 1. Extract all phrases that appear in search results 2. Weigh and suggest the best phrases for query refinement Gene’s collaborative search system Querium
  • 22. Step 1: Get time estimates A. Take a subset, e.g. 100 documents B. Run on various machines / settings C. Extrapolate to the entire dataset, e.g. 1.7M docs Our example: • Standard laptop 4 Core, 8GB RAM: 30 days • Similar Rackspace VM: 46 days • Threading reduces time: 24 days
  • 23. Step 2: Look into your data Understand the nature of your data: look at samples, compute statistcs. Speed up by removing anomalies & targetting the text analytics. Our example: 30% docs exceed 50KB (some ≈600KB) Most important phrase appear in title, abstract, introduction and conclusions.  Only process top 30% and last 20% This reduces the time by 57%!
  • 24. Validate: Can we crop our documents? Top 20 keywords from*… …original document ...cropped document Top N How many were ontology ontology knowledge base knowledge base keywords in found in the knowledge knowledge engineering original doc cropped doc representation knowledge Semantic Web representation 10 91% WordNet WordNet 50 80% knowledge engineering predicate logic predicate logic artificial intelligence 100 75% artificial intelligence ontology engineering semantic networks semantic networks All 64% natural language Semantic Web first-order logic first-order logic ontology engineering block diagram lexicon dynamic systems conceptual graphs higher-order logic higher-order logic conceptual graphs natural language modeling & simulation processing universe of discourse * Toward principles for the design of design rationale bond graph ontologies used for knowledge sharing block diagram lexicon T. R. Gruber (1993)
  • 25. Step 3: Go cloud Don’t be afraid to bring out the big guns • Large Elastic Compute instance 1000 docs x 4 threads = 30 min • High-CPU Extra Large (8 virtual cores) 1000 docs x 24 threads = 6 min Also: increase the number of machines • 4 machines = 4 times faster, i.e. 50 instead of 200 hours (or 1 weekend!)
  • 26. How long would a human need to extract keywords from 1.7M docs? Min per Min Hours Days* Years** doc 1 1.700.000 28.333 3.542 14 2 3.400.000 56.666 7.083 28 3 5.100.000 85.000 10.625 42 * Taking into account 8h per working day ** Assuming 250 working days per year (no holidays, no sickdays) http://www.flickr.com/photos/mararie/2663711551/
  • 27. Document Candidates Properties Scoring Keywords To estimate quality, take a sample and compute inter-indexer consistency between several people CiteSeer 1.7 Million scientific publications 110 GB 1. Get time estimates Can be done 2. Look into your data in a weekend 3. Go cloud Don’t do it manually! Keyword extraction : medelyan.com/files/phd2009.pdf CiteSeer study: pingar.com/technical-blog/ Pingar API: apidemo.pingar.com

Editor's Notes

  1. KEA performs better than 8 of the best taggers
  2. Dev machine: 4 Core CPU, 8 RAMRackspace
  3. So among the top 10 keywords from the full document, 91% appear in the keywords from the chopped document (so, basically 9 out of 10 are the same),
  4. Costs ~250 USD