Large Scale Topic Modeling
By - Sameer Wadkar
Big Data Architect / Data Scientist
July 7th, 2013 © 2013 Axiomine LLC
What is Topic Modeling
• Technique is called Latent Dirichlet Allocation (LDA)
• An excellent explanation is available in ...
Brief Overview of LDA
• What can LDA do?
• LDA extracts key topics and themes from a large corpus of
text
• Each topic is ...
LDA – A Quick Example
“I listened to Justin Bieber and Lady Gaga on the radio
while driving around in my car”, an LDA mode...
Sarah Palin Email Corpus
• Sarah Palin Email Corpus
• In June 2011 several thousand emails from Sarah Palin’s
time as gove...
LDA Analysis Results
Wildlife/ BP
Corrosion
•game
•fish
•moose
•wildlife, hunting
•bears
•polar
•bear
•subsistence
•manage...
Temporal Extraction MethodologyLDA Sample from Wildlife topic
July 7th, 2013 © 2013 Axiomine LLC
Temporal Extraction MethodologyLDA Sample from multiple topics
LDA classification of above email
Topic Allocation Percenta...
Types of Analysis LDA can perform
• Similarity Analysis
• Which topics are similar?
• Which documents are similar based on...
Brief Overview of LDA
• Topic Co-occurance
• Do certain topics occur together in documents?
• Analysis of software resumes...
Brief Overview of LDA
• Time based Analysis
• For a corpus which covers documents over time, do certain topics
appear over...
Example based on Sarah Palin’s email corpus
• Analyze emails which below to Trig/Family/Inspiration
topics
• Spike in Apri...
Topic Modeling Toolkits
• Several Open Source Options exist
Library Name URL
Mallet MALLET is a Java-based package for sta...
Mahout for Big LDA
• Apache Mahout
• Hadoop MapReduce based suite of Machine Learning procedures
• Implements several Mach...
Parallel LDA based on Mallet
• A Parallel LDA in Mallet is based on
• Newman, Asuncion, Smyth and Welling, Distributed
Alg...
Axiomine Solution – Big LDA without Hadoop
• Map-Reduce is unsuitable for LDA type Algorithms
• Hadoop is complex and unsu...
Axiomine Solution – Performance Numbers
Machine Type Corpus Performance
Single 8 GB, Intel I-7 Quad-
core machine
500000 U...
Conclusion – Large Scale Analysis without Big Data
• The Axiomine LDA implementation has the following
benefits
• Scaling ...
Upcoming SlideShare
Loading in...5
×

Large scale topic modeling

4,010

Published on

Published in: Technology, News & Politics

Large scale topic modeling

  1. 1. Large Scale Topic Modeling By - Sameer Wadkar Big Data Architect / Data Scientist July 7th, 2013 © 2013 Axiomine LLC
  2. 2. What is Topic Modeling • Technique is called Latent Dirichlet Allocation (LDA) • An excellent explanation is available in the following blog article by Edwin Chen from Google (http://blog.echen.me/2011/06/27/topic-modeling-the- sarah-palin-emails/) • This presentation borrows heavily from the blog article to explain the basics of Topic Modeling July 7th, 2013 © 2013 Axiomine LLC
  3. 3. Brief Overview of LDA • What can LDA do? • LDA extracts key topics and themes from a large corpus of text • Each topic is a ordered list of representative words (Order is based on importance of word to a Topic) • LDA describes each document in the corpus based on allocation to the extracted topics. • It is an Unsupervised Learning Technique • No extensive preparation needed to create a training dataset • Easy to apply for exploratory analysis July 7th, 2013 © 2013 Axiomine LLC
  4. 4. LDA – A Quick Example “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, contains words Bieber, Gaga , radio ) and 25% about cars (a topic which contains words driving and cars ). July 7th, 2013 © 2013 Axiomine LLC
  5. 5. Sarah Palin Email Corpus • Sarah Palin Email Corpus • In June 2011 several thousand emails from Sarah Palin’s time as governor of Alaska were released (http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/) • Emails were not organized in any form • The Edwin Chen blog article discusses how LDA was used to organize these emails in categories discovered from the Email Corpus using LDA. July 7th, 2013 © 2013 Axiomine LLC
  6. 6. LDA Analysis Results Wildlife/ BP Corrosion •game •fish •moose •wildlife, hunting •bears •polar •bear •subsistence •management •area •board •hunt •wolves •control •department •year •use •wolf •habitat •hunters •caribou • program •Fishing….. Energy/ Fuel/ Oil Mining •energy •fuel •costs •oil •alaskans •prices •cost •nome •Now •high •being •home •public •power •mine •crisis •price •resource •need •community •fairbanks •rebate •use •mining •Villages … Trig/ Family/ Inspiration •family •web •mail •god •son •from •congratulations • children •life •child •down •trig •baby •birth •love •You •syndrome •very •special •bless •old •husband •years •thank •best … Gas •gas •oil •pipeline •agia •project •natural •north •producers •companies •tax •company •energy •development •slope •production •resources •line •gasline •transcanada •said •billion •plan •administration •million •industry, … Education/ Waste •school •waste •education •students •schools •million •read •email •market •policy •student •year •high •news •states •program •first •report •business •management •bulletin •information •reports •2008 •quarter … Presidential Campaign/ Elections •mail •web •from •thank, you •box •mccain •sarah •very •good •great •john •hope •president •sincerely •wasilla •work •keep •make •add •family •republican •support •doing •p.o, … • LDA Analysis of Sarah Palin’s emails discovered the following topics (notice the ordered list of words) July 7th, 2013 © 2013 Axiomine LLC
  7. 7. Temporal Extraction MethodologyLDA Sample from Wildlife topic July 7th, 2013 © 2013 Axiomine LLC
  8. 8. Temporal Extraction MethodologyLDA Sample from multiple topics LDA classification of above email Topic Allocation Percentage Presidential Campaign/ Elections 10% Wildlife 90% July 7th, 2013 © 2013 Axiomine LLC
  9. 9. Types of Analysis LDA can perform • Similarity Analysis • Which topics are similar? • Which documents are similar based on Topic Allocations? • LDA can distinguish between business articles related to “Mergers” from those related to “Quarterly Earnings” which leads to more potent Similarity Analysis • LDA determines Topic Allocation based on collocation of word groups. Hence “IBM” and “Microsoft” documents can be discovered to be similar if they talk about similar computing topics • Similarity Analysis based on LDA is very accurate since • LDA converts the high-dimensional and noisy space of Word/Document allocations into a low dimensional Topic/Document allocations. July 7th, 2013 © 2013 Axiomine LLC
  10. 10. Brief Overview of LDA • Topic Co-occurance • Do certain topics occur together in documents? • Analysis of software resumes will reveal that “Object Oriented Language” skills typically co-occur with “SQL and RDBMS skills” • Does Topic Co-occurance change with time? • Resume corpus would reveal that “Java” skills was highly correlated with “Flash Development” skills in 2007. In 2013 the correlation has shifted to “Java” and “HTML5” but not as much as in 2007 indicating that HTML5 is a more specialized skill July 7th, 2013 © 2013 Axiomine LLC
  11. 11. Brief Overview of LDA • Time based Analysis • For a corpus which covers documents over time, do certain topics appear over time • How does appearance of new topics affect the distribution of other topics • Analysis of science articles from the Journal of Science (1880-2002) reveals this process • http://topics.cs.princeton.edu/Science/ • The Browser is at http://topics.cs.princeton.edu/Science/browser/ • 75 topic model • Demonstrates how Topics gain/lose prominence over time • Demonstrate how a Topic composition changes over time • Demonstrates how new Topics appear • Ex. Laser made an appearance in its topic only in 1980 July 7th, 2013 © 2013 Axiomine LLC
  12. 12. Example based on Sarah Palin’s email corpus • Analyze emails which below to Trig/Family/Inspiration topics • Spike in April 2008 – Remarkably (for Topic Modeling) and Unsurprisingly (for common sense), this was exactly the month Trig was born. • Topic Modeling can discover such patterns from a large Text Corpus without requiring a human to read the entire corpus. July 7th, 2013 © 2013 Axiomine LLC
  13. 13. Topic Modeling Toolkits • Several Open Source Options exist Library Name URL Mallet MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. R Based Library R based library to perform Topic Modeling Apache Mahout Big Data solution of Topic Modeling. Why is it needed? • Topic Modeling is computationally expensive • Requires large amounts of memory • Requires considerable computational power • Memory is bigger constraint • Most implementations run out of memory when applied on even a modest number of documents (50,000 to 100,000 documents) • If they do not run out of memory they slow down to a crawl due to frequent Garbage Collection (in Java based environment) • A Big Data based approach is needed! July 7th, 2013 © 2013 Axiomine LLC
  14. 14. Mahout for Big LDA • Apache Mahout • Hadoop MapReduce based suite of Machine Learning procedures • Implements several Machine Learning routines which are based on Bayesian techniques (Ex. Generative Algorithms) • Generative Algorithms are iterative and iterations converge to a solution • Each iteration needs the results produced by the previous iteration. Hence Iterations cannot be executed in parallel • Several iterations (a few thousand) are needed to converge to a solution • Mahout uses Map-Reduce to parallelize a single iteration • Each iteration is a separate Map-Reduce job • Inter-Iteration communication using HDFS. Leads to high I/O • High I/O compounded by multi-iteration nature • Mahout based LDA • Each iteration is slower to accommodate large memory requirements • Typically 1000 iterations needed. Takes too long to run. Unsuitable for exploratory analysis • Lesser iterations lead to sub-optimal solution July 7th, 2013 © 2013 Axiomine LLC
  15. 15. Parallel LDA based on Mallet • A Parallel LDA in Mallet is based on • Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009) • Still memory intensive • Large corpus leads to frequent Garbage Collection • Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad Core processor on 500,000 US Patent abstracts 400 minutes of processing for 1000 iterations. • The application makes no progress for 1 Million Patents and eventually runs out of memory or stalls due to frequent Garbage Collection July 7th, 2013 © 2013 Axiomine LLC
  16. 16. Axiomine Solution – Big LDA without Hadoop • Map-Reduce is unsuitable for LDA type Algorithms • Hadoop is complex and unsuited for ad-hoc analysis • Large number of sequential iterations only allows Map-Reduce to be used at Iteration level. Leads to too many short Map-Reduce jobs • Large scale LDA without Big Data • LDA is a memory intensive process • Off-Heap memory based on Java NIO allows processes to use memory without incurring GC penalty. • Trade-off is slightly lower performance • Exploit the OS page-caching to use off-heap memory • LDA operates on Text data. But soring text is orders of magnitude more expensive as compared to storing numbers • Massive off-heap memory based indexes which map words to numbers allow significant lowering of memory usage • Reorganizing the Mallet implementation steps achieved significant performance gains and memory savings July 7th, 2013 © 2013 Axiomine LLC
  17. 17. Axiomine Solution – Performance Numbers Machine Type Corpus Performance Single 8 GB, Intel I-7 Quad- core machine 500000 US Patent Abstracts, 600 1000 Iterations completed in 2 hours Amazon AWS hs1.8xlarge machine (http://aws.amazon.com/ec2/in stance-types/) 2.1 Million US Patent Abstracts, 600 topics using 5 CPU threads 1000 Iterations completed in approximately 5 hours. • High Points • Scaling is practically linear unlike other implementations • Each iteration takes between 7-15 seconds • We contemplated Apache HAMA to achieve parallelism without incurring the disk I/O cost of Hadoop Map-Reduce • But Network I/O will ensure worse intra-iteration performance than we could achieve on a single machine! • Big Topic Modeling without Big Data!! • At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA • We want to bring Large Scale Exploratory Analysis at low complexity July 7th, 2013 © 2013 Axiomine LLC
  18. 18. Conclusion – Large Scale Analysis without Big Data • The Axiomine LDA implementation has the following benefits • Scaling is practically linear unlike other implementations • Each iteration takes between 7-15 seconds • We contemplated Apache HAMA to achieve parallelism without incurring the disk I/O cost of Hadoop Map-Reduce • But Network I/O will ensure worse intra-iteration performance than we could achieve on a single machine! • Big Topic Modeling without Big Data!! • At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA • We want to bring Large Scale Exploratory Analysis at low complexity July 7th, 2013 © 2013 Axiomine LLC
  1. ¿Le ha llamado la atención una diapositiva en particular?

    Recortar diapositivas es una manera útil de recopilar información importante para consultarla más tarde.

×