Text Analytics Case Study on 2 Million Documents

Text Analytics World, Boston, October 3-4, 2012

Text Analytics on 2 Million
Documents: A Case Study
Plus, An Introduction into Keyword Extraction

Alyona Medelyan

What are these books about?
“Because he could” by D. Morris, E. McGann

“Still stripping after 25 years” by E. Burns

“Glut” by A. Wright

Only metadata will tell…

What this talk will cover:
• Who am I & my relation to the topic
• What types of keyword extraction are out there
• How does keyword extraction work
• How accurate can keywords be
• How to analyze 2 million documents efficiently

My Background

@zelandiya
medelyan.com
2005-2009 PhD Thesis on keyword extraction
“Human-competitive automatic topic indexing”
Maui
Multi-purpose
automatic topic indexing

nzdl.org/kea/ maui-indexer.googlecode.com

2010 co-organized keyword extraction competition
SemEval-2 SemEval-2, Track 5 “Automatic keyphrase extraction from scientific articles”

2010-2012 leading the R&D of Pingar’s text analytics API
Pingar API features: keyword & named entities extraction, summarization etc.

Findability is ensured with the help of metadata

Document Easy to extract: Metadata
Title, file type & location,
creation & modification date,
authors, publisher

Difficult to extract:
Keywords & keyphrases,
people & companies mentioned,
suppliers & addresses mentioned

What can text analytics determine from text?
focus of this presentation

keywords text text text

tags
text text text

sentiment
text text text
text text text
text text text
text text text

genre

categories
taxonomy terms
entities

names biochemical
patterns … entities
text text text
text text text
text text text
text text text
text text text
text text text

Types of keyword extraction (or topic indexing)

• Subject headings in libraries
• general with Library of Congress Subject Headings
• domain-specific in PubMed with MeSH categories
taxonomy terms
controlled indexing

• Keyphrases in academic publications
keywords
tags
• Tags in folksonomies
• by authors on Technorati
• by users on Del.icio.us

free indexing

Free indexing Controlled indexing
E.g. keywords, tags E.g. LCSH, ACM, MeSH
Inconsistent Restricted
No control Centrally controlled
No semantics Inflexible
Ad hoc Not always available

How keyword extraction works
Document Candidates Keywords

1. Extract phrases using the sliding window approach
NEJM usually has the highest impact factor of the journals of clinical medicine.

ignore Alternative approach:
stopwords a) Assign part-of-speech tags
b) Extract valid noun phrases (NPs)

NEJM
highest, highest impact, highest impact factor
impact, impact factor…

Document Candidates Keywords

2. Normalize phrases (case folding, stemming etc.)
NEJM usually has the highest impact factor of the journals of clinical medicine.
NEJM nejm New England J of Med
highest high -
highest impact factor high impact factor -
impact impact -
impact factor impact factor Impact Factor
journals journal Journal
journals of clinical journal of clinic -
clinical clinic Clinic
clinical medicine clinic medic Medicine
medicine medic Medicine

Document Candidates Properties Keywords

1. Frequency: number of occurrences (incl. synonyms)
2. Position: beginning/end of a document, title, headers
3. Phrase length: longer means more specific
4. Similarity: semantic relatedness to other candidates
5. Corpus statistics: how prominent in this particular text
6. Popularity: how often people select this candidate
7. Part of speech pattern: some patterns are more common
…

Document Candidates Properties Scoring Keywords

Heuristics Supervised machine learning

 A formula that combines most  Train a model from manually
powerful features indexed documents

• requires accurate crafting • requires training data
• performs equaly well or less well • performs really well on docs that
across various domains are similar to training data, but
poorly on dissimilar ones

How accurate is keyword extraction?
• It’s subjective…
• But: the higher the indexing consistency is,
the better the search effectiveness (findability)

A – set of keyphrases 1
A B – set of keyphrases 2
C – set of keyphrases in common
C
ConsistencyRolling = 2C / (A + B)
B
ConsistencyHopper = C / (A + B – C)

Professional indexers’ keywords*
Agrovoc terms: energy public
value nutritional health
disorders regulations
weight
reduction nutrient disease developing
excesses control countries
nutritional
diet requirements
dietary nutrition nutrition developed
guidelines feeding status programs countries
meal habits
patterns nutrition
surveillance
overweight
food
nutritional policies price
physiology
formation

food
overeating intake human nutrition
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
urbanization globalization
taxation
taxes

* 6 professional FAO indexers assigned terms from the Agrovoc thesaurus
to the same document, entitled “The global obesity problem”

Comparison of 2 indexers
Agrovoc relation: value nutritional health
Indexer 1: weight
reduction nutrient disease developing
Indexer 2: excesses countries
control
nutritional
diet requirements
meal habits
patterns nutrition
surveillance
overweight
food
physiology
formation

food
nutrition policies
price
foods food
fiscal policies
consumption
policies
prices
direct
taxation
taxes

Comparison of 6 indexers & Kea
Agrovoc relation: value nutritional health
Indexers: weight
reduction nutrient
1 2 3 4 5 6 disease developing
excesses control countries
nutritional
Kea Algorithm: diet requirements
meal habits
patterns nutrition
body weight overweight surveillance
food
physiology
formation
price fixing
saturated fat food
nutrition policies controlled prices
foods food price
policies
consumption fiscal policies
policies prices
direct
taxation
taxes

Comparison of CS students* & Maui

* 15 teams of 2 students each assigned keywords to the same document,
entitled “A safe, efficient regression test selection technique”

Human vs. algorithm consistency
6 Professional indexers vs. Kea on 30 agricultural documents & Agrovoc thesaurus
Method Min Avg Max
Professionals 26 39 47
KEA 24 32 38

15 teams of 2 CS students vs. Maui on 20 CS documents & Wikipedia vocabulary
Method Min Avg Max
Students 21 31 37
Maui 24 32 36

CiteULike taggers vs. Maui (each tagger had ≥ 2 co-taggers) & free indexing
With other taggers With Maui
330 taggers & 180 docs 19 24
35 taggers & 140 docs 38 35

Text Analytics on 2 Million Documents:
A Case Study

+

Collaboration with Gene Golovchinsky
fxpal.com/?p=gene

The dataset
Twitter
490 Million
CiteSeer tweets per
1.7 Million week
scientific 84 GB
publications
110 GB Wikipedia
3.6 Million articles
13 GB

Britannica
0.65 Million articles
ICWSM 2011 0.3 GB
2.1 TB (compressed!)
News, blogs, forums, etc.
slideshare.net/raffikrikorian/twitter-by-the-numbers
en.wikipedia.org/wiki/Wikipedia:Size_comparisons

The task goal
1. Extract all phrases that appear in search results
2. Weigh and suggest the best phrases for query refinement

Gene’s collaborative search system Querium

Step 1: Get time estimates
A. Take a subset, e.g. 100 documents
B. Run on various machines / settings
C. Extrapolate to the entire dataset, e.g. 1.7M docs

Our example:

• Standard laptop 4 Core, 8GB RAM: 30 days
• Similar Rackspace VM: 46 days
• Threading reduces time: 24 days

Step 2: Look into your data
Understand the nature of your data:
look at samples, compute statistcs.
Speed up by removing anomalies & targetting the text analytics.

Our example:

30% docs exceed 50KB (some ≈600KB)

Most important phrase appear in title,
abstract, introduction and conclusions.

 Only process top 30% and last 20%
This reduces the time by 57%!

Validate: Can we crop our documents?
Top 20 keywords from*…
…original document ...cropped document
Top N How many were ontology ontology
knowledge base knowledge base
keywords in found in the knowledge knowledge engineering
original doc cropped doc representation knowledge
Semantic Web representation
10 91% WordNet WordNet
50 80% knowledge engineering predicate logic
predicate logic artificial intelligence
100 75% artificial intelligence ontology engineering
semantic networks semantic networks
All 64% natural language Semantic Web
first-order logic first-order logic
ontology engineering block diagram
lexicon dynamic systems
conceptual graphs higher-order logic
higher-order logic conceptual graphs
natural language modeling & simulation
processing universe of discourse
* Toward principles for the design of design rationale bond graph
ontologies used for knowledge sharing block diagram lexicon
T. R. Gruber (1993)

Step 3: Go cloud

Don’t be afraid to bring out the big guns

• Large Elastic Compute instance
1000 docs x 4 threads = 30 min

• High-CPU Extra Large (8 virtual cores)
1000 docs x 24 threads = 6 min

Also: increase the number of machines

• 4 machines = 4 times faster,
i.e. 50 instead of 200 hours (or 1 weekend!)

How long would a human
need to extract keywords
from 1.7M docs?

Min per Min Hours Days* Years**
doc
1 1.700.000 28.333 3.542 14
2 3.400.000 56.666 7.083 28
3 5.100.000 85.000 10.625 42

* Taking into account 8h per working day
** Assuming 250 working days per year (no holidays, no sickdays)

http://www.flickr.com/photos/mararie/2663711551/

Document Candidates Properties Scoring Keywords

To estimate quality, take a sample and compute
inter-indexer consistency between several people

CiteSeer
1.7 Million
scientific
publications
110 GB 1. Get time estimates
Can be done 2. Look into your data
in a weekend 3. Go cloud

Don’t do it manually!

Keyword extraction : medelyan.com/files/phd2009.pdf
CiteSeer study: pingar.com/technical-blog/
Pingar API: apidemo.pingar.com

Text Analytics Case Study on 2 Million Documents

Recommended

Recommended

More Related Content

Similar to Text Analytics Case Study on 2 Million Documents

Similar to Text Analytics Case Study on 2 Million Documents (20)

More from Peter Wren-Hilton

More from Peter Wren-Hilton (6)

Text Analytics Case Study on 2 Million Documents

Editor's Notes