Large scale topic modeling

Large Scale Topic Modeling
By - Sameer Wadkar
Big Data Architect / Data Scientist
July 7th, 2013 © 2013 Axiomine LLC

What is Topic Modeling
• Technique is called Latent Dirichlet Allocation (LDA)
• An excellent explanation is available in the following blog
article by Edwin Chen from Google
(http://blog.echen.me/2011/06/27/topic-modeling-the-
sarah-palin-emails/)
• This presentation borrows heavily from the blog article to
explain the basics of Topic Modeling

Brief Overview of LDA
• What can LDA do?
• LDA extracts key topics and themes from a large corpus of
text
• Each topic is a ordered list of representative words (Order is
based on importance of word to a Topic)
• LDA describes each document in the corpus based on
allocation to the extracted topics.
• It is an Unsupervised Learning Technique
• No extensive preparation needed to create a training dataset
• Easy to apply for exploratory analysis

LDA – A Quick Example
“I listened to Justin Bieber and Lady Gaga on the radio
while driving around in my car”, an LDA model might
represent this sentence as 75% about music (a topic which,
contains words Bieber, Gaga , radio ) and 25% about cars (a
topic which contains words driving and cars ).

Sarah Palin Email Corpus
• Sarah Palin Email Corpus
• In June 2011 several thousand emails from Sarah Palin’s
time as governor of Alaska were released
(http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/)
• Emails were not organized in any form
• The Edwin Chen blog article discusses how LDA was used
to organize these emails in categories discovered from the
Email Corpus using LDA.

LDA Analysis Results
Wildlife/ BP
Corrosion
•game
•fish
•moose
•wildlife, hunting
•bears
•polar
•bear
•subsistence
•management
•area
•board
•hunt
•wolves
•control
•department
•year
•use
•wolf
•habitat
•hunters
•caribou
• program
•Fishing…..
Energy/ Fuel/
Oil Mining
•energy
•fuel
•costs
•oil
•alaskans
•prices
•cost
•nome
•Now
•high
•being
•home
•public
•power
•mine
•crisis
•price
•resource
•need
•community
•fairbanks
•rebate
•use
•mining
•Villages …
Trig/ Family/
Inspiration
•family
•web
•mail
•god
•son
•from
•congratulations
• children
•life
•child
•down
•trig
•baby
•birth
•love
•You
•syndrome
•very
•special
•bless
•old
•husband
•years
•thank
•best …
Gas
•gas
•oil
•pipeline
•agia
•project
•natural
•north
•producers
•companies
•tax
•company
•energy
•development
•slope
•production
•resources
•line
•gasline
•transcanada
•said
•billion
•plan
•administration
•million
•industry, …
Education/
Waste
•school
•waste
•education
•students
•schools
•million
•read
•email
•market
•policy
•student
•year
•high
•news
•states
•program
•first
•report
•business
•management
•bulletin
•information
•reports
•2008
•quarter …
Presidential
Campaign/
Elections
•mail
•web
•from
•thank, you
•box
•mccain
•sarah
•very
•good
•great
•john
•hope
•president
•sincerely
•wasilla
•work
•keep
•make
•add
•family
•republican
•support
•doing
•p.o, …
• LDA Analysis of Sarah Palin’s emails discovered the
following topics (notice the ordered list of words)

Temporal Extraction MethodologyLDA Sample from Wildlife topic

Temporal Extraction MethodologyLDA Sample from multiple topics
LDA classification of above email
Topic Allocation Percentage
Presidential Campaign/ Elections 10%
Wildlife 90%

Types of Analysis LDA can perform
• Similarity Analysis
• Which topics are similar?
• Which documents are similar based on Topic Allocations?
• LDA can distinguish between business articles related to “Mergers”
from those related to “Quarterly Earnings” which leads to more
potent Similarity Analysis
• LDA determines Topic Allocation based on collocation of word
groups. Hence “IBM” and “Microsoft” documents can be discovered
to be similar if they talk about similar computing topics
• Similarity Analysis based on LDA is very accurate since
• LDA converts the high-dimensional and noisy space of
Word/Document allocations into a low dimensional Topic/Document
allocations.

• Topic Co-occurance
• Do certain topics occur together in documents?
• Analysis of software resumes will reveal that “Object Oriented
Language” skills typically co-occur with “SQL and RDBMS skills”
• Does Topic Co-occurance change with time?
• Resume corpus would reveal that “Java” skills was highly correlated
with “Flash Development” skills in 2007. In 2013 the correlation has
shifted to “Java” and “HTML5” but not as much as in 2007 indicating
that HTML5 is a more specialized skill

• Time based Analysis
• For a corpus which covers documents over time, do certain topics
appear over time
• How does appearance of new topics affect the distribution of other topics
• Analysis of science articles from the Journal of Science (1880-2002)
reveals this process
• http://topics.cs.princeton.edu/Science/
• The Browser is at http://topics.cs.princeton.edu/Science/browser/
• 75 topic model
• Demonstrates how Topics gain/lose prominence over time
• Demonstrate how a Topic composition changes over time
• Demonstrates how new Topics appear
• Ex. Laser made an appearance in its topic only in 1980

Example based on Sarah Palin’s email corpus
• Analyze emails which below to Trig/Family/Inspiration
topics
• Spike in April 2008 – Remarkably (for Topic Modeling) and
Unsurprisingly (for common sense), this was exactly the month
Trig was born.
• Topic Modeling can discover such patterns from a large Text
Corpus without requiring a human to read the entire corpus.

Topic Modeling Toolkits
• Several Open Source Options exist
Library Name URL
Mallet MALLET is a Java-based package for statistical natural language processing,
document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text.
R Based Library R based library to perform Topic Modeling
Apache Mahout Big Data solution of Topic Modeling. Why is it needed?
• Topic Modeling is computationally expensive
• Requires large amounts of memory
• Requires considerable computational power
• Memory is bigger constraint
• Most implementations run out of memory when applied on even a
modest number of documents (50,000 to 100,000 documents)
• If they do not run out of memory they slow down to a crawl due to
frequent Garbage Collection (in Java based environment)
• A Big Data based approach is needed!

Mahout for Big LDA
• Apache Mahout
• Hadoop MapReduce based suite of Machine Learning procedures
• Implements several Machine Learning routines which are based on
Bayesian techniques (Ex. Generative Algorithms)
• Generative Algorithms are iterative and iterations converge to a solution
• Each iteration needs the results produced by the previous iteration.
Hence Iterations cannot be executed in parallel
• Several iterations (a few thousand) are needed to converge to a
solution
• Mahout uses Map-Reduce to parallelize a single iteration
• Each iteration is a separate Map-Reduce job
• Inter-Iteration communication using HDFS. Leads to high I/O
• High I/O compounded by multi-iteration nature
• Mahout based LDA
• Each iteration is slower to accommodate large memory requirements
• Typically 1000 iterations needed. Takes too long to run. Unsuitable
for exploratory analysis
• Lesser iterations lead to sub-optimal solution

Parallel LDA based on Mallet
• A Parallel LDA in Mallet is based on
• Newman, Asuncion, Smyth and Welling, Distributed
Algorithms for Topic Models JMLR (2009), with SparseLDA
sampling scheme and data structure from Yao, Mimno and
McCallum, Efficient Methods for Topic Model Inference on
Streaming Document Collections, KDD (2009)
• Still memory intensive
• Large corpus leads to frequent Garbage Collection
• Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad
Core processor on 500,000 US Patent abstracts 400
minutes of processing for 1000 iterations.
• The application makes no progress for 1 Million Patents and
eventually runs out of memory or stalls due to frequent
Garbage Collection

Axiomine Solution – Big LDA without Hadoop
• Map-Reduce is unsuitable for LDA type Algorithms
• Hadoop is complex and unsuited for ad-hoc analysis
• Large number of sequential iterations only allows Map-Reduce to be
used at Iteration level. Leads to too many short Map-Reduce jobs
• Large scale LDA without Big Data
• LDA is a memory intensive process
• Off-Heap memory based on Java NIO allows processes to use
memory without incurring GC penalty.
• Trade-off is slightly lower performance
• Exploit the OS page-caching to use off-heap memory
• LDA operates on Text data. But soring text is orders of magnitude
more expensive as compared to storing numbers
• Massive off-heap memory based indexes which map words to
numbers allow significant lowering of memory usage
• Reorganizing the Mallet implementation steps achieved significant
performance gains and memory savings

Axiomine Solution – Performance Numbers
Machine Type Corpus Performance
Single 8 GB, Intel I-7 Quad-
core machine
500000 US Patent Abstracts,
600
1000 Iterations completed in 2
hours
Amazon AWS hs1.8xlarge
machine
(http://aws.amazon.com/ec2/in
stance-types/)
2.1 Million US Patent
Abstracts, 600 topics using 5
CPU threads
1000 Iterations completed in
approximately 5 hours.
• High Points
• Scaling is practically linear unlike other implementations
• Each iteration takes between 7-15 seconds
• We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
• But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
• Big Topic Modeling without Big Data!!
• At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
• We want to bring Large Scale Exploratory Analysis at low complexity

Conclusion – Large Scale Analysis without Big Data
• The Axiomine LDA implementation has the following
benefits
• Scaling is practically linear unlike other implementations
• Each iteration takes between 7-15 seconds
• We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
• But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
• Big Topic Modeling without Big Data!!
• At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
• We want to bring Large Scale Exploratory Analysis at low complexity

Large scale topic modeling

Recommended

Recommended

More Related Content

Similar to Large scale topic modeling

Similar to Large scale topic modeling (20)

Recently uploaded

Recently uploaded (20)

Large scale topic modeling