SlideShare a Scribd company logo
1 of 18
Download to read offline
Large Scale Topic Modeling
By - Sameer Wadkar
Big Data Architect / Data Scientist
July 7th, 2013 © 2013 Axiomine LLC
What is Topic Modeling
• Technique is called Latent Dirichlet Allocation (LDA)
• An excellent explanation is available in the following blog
article by Edwin Chen from Google
(http://blog.echen.me/2011/06/27/topic-modeling-the-
sarah-palin-emails/)
• This presentation borrows heavily from the blog article to
explain the basics of Topic Modeling
July 7th, 2013 © 2013 Axiomine LLC
Brief Overview of LDA
• What can LDA do?
• LDA extracts key topics and themes from a large corpus of
text
• Each topic is a ordered list of representative words (Order is
based on importance of word to a Topic)
• LDA describes each document in the corpus based on
allocation to the extracted topics.
• It is an Unsupervised Learning Technique
• No extensive preparation needed to create a training dataset
• Easy to apply for exploratory analysis
July 7th, 2013 © 2013 Axiomine LLC
LDA – A Quick Example
“I listened to Justin Bieber and Lady Gaga on the radio
while driving around in my car”, an LDA model might
represent this sentence as 75% about music (a topic which,
contains words Bieber, Gaga , radio ) and 25% about cars (a
topic which contains words driving and cars ).
July 7th, 2013 © 2013 Axiomine LLC
Sarah Palin Email Corpus
• Sarah Palin Email Corpus
• In June 2011 several thousand emails from Sarah Palin’s
time as governor of Alaska were released
(http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/)
• Emails were not organized in any form
• The Edwin Chen blog article discusses how LDA was used
to organize these emails in categories discovered from the
Email Corpus using LDA.
July 7th, 2013 © 2013 Axiomine LLC
LDA Analysis Results
Wildlife/ BP
Corrosion
•game
•fish
•moose
•wildlife, hunting
•bears
•polar
•bear
•subsistence
•management
•area
•board
•hunt
•wolves
•control
•department
•year
•use
•wolf
•habitat
•hunters
•caribou
• program
•Fishing…..
Energy/ Fuel/
Oil Mining
•energy
•fuel
•costs
•oil
•alaskans
•prices
•cost
•nome
•Now
•high
•being
•home
•public
•power
•mine
•crisis
•price
•resource
•need
•community
•fairbanks
•rebate
•use
•mining
•Villages …
Trig/ Family/
Inspiration
•family
•web
•mail
•god
•son
•from
•congratulations
• children
•life
•child
•down
•trig
•baby
•birth
•love
•You
•syndrome
•very
•special
•bless
•old
•husband
•years
•thank
•best …
Gas
•gas
•oil
•pipeline
•agia
•project
•natural
•north
•producers
•companies
•tax
•company
•energy
•development
•slope
•production
•resources
•line
•gasline
•transcanada
•said
•billion
•plan
•administration
•million
•industry, …
Education/
Waste
•school
•waste
•education
•students
•schools
•million
•read
•email
•market
•policy
•student
•year
•high
•news
•states
•program
•first
•report
•business
•management
•bulletin
•information
•reports
•2008
•quarter …
Presidential
Campaign/
Elections
•mail
•web
•from
•thank, you
•box
•mccain
•sarah
•very
•good
•great
•john
•hope
•president
•sincerely
•wasilla
•work
•keep
•make
•add
•family
•republican
•support
•doing
•p.o, …
• LDA Analysis of Sarah Palin’s emails discovered the
following topics (notice the ordered list of words)
July 7th, 2013 © 2013 Axiomine LLC
Temporal Extraction MethodologyLDA Sample from Wildlife topic
July 7th, 2013 © 2013 Axiomine LLC
Temporal Extraction MethodologyLDA Sample from multiple topics
LDA classification of above email
Topic Allocation Percentage
Presidential Campaign/ Elections 10%
Wildlife 90%
July 7th, 2013 © 2013 Axiomine LLC
Types of Analysis LDA can perform
• Similarity Analysis
• Which topics are similar?
• Which documents are similar based on Topic Allocations?
• LDA can distinguish between business articles related to “Mergers”
from those related to “Quarterly Earnings” which leads to more
potent Similarity Analysis
• LDA determines Topic Allocation based on collocation of word
groups. Hence “IBM” and “Microsoft” documents can be discovered
to be similar if they talk about similar computing topics
• Similarity Analysis based on LDA is very accurate since
• LDA converts the high-dimensional and noisy space of
Word/Document allocations into a low dimensional Topic/Document
allocations.
July 7th, 2013 © 2013 Axiomine LLC
Brief Overview of LDA
• Topic Co-occurance
• Do certain topics occur together in documents?
• Analysis of software resumes will reveal that “Object Oriented
Language” skills typically co-occur with “SQL and RDBMS skills”
• Does Topic Co-occurance change with time?
• Resume corpus would reveal that “Java” skills was highly correlated
with “Flash Development” skills in 2007. In 2013 the correlation has
shifted to “Java” and “HTML5” but not as much as in 2007 indicating
that HTML5 is a more specialized skill
July 7th, 2013 © 2013 Axiomine LLC
Brief Overview of LDA
• Time based Analysis
• For a corpus which covers documents over time, do certain topics
appear over time
• How does appearance of new topics affect the distribution of other topics
• Analysis of science articles from the Journal of Science (1880-2002)
reveals this process
• http://topics.cs.princeton.edu/Science/
• The Browser is at http://topics.cs.princeton.edu/Science/browser/
• 75 topic model
• Demonstrates how Topics gain/lose prominence over time
• Demonstrate how a Topic composition changes over time
• Demonstrates how new Topics appear
• Ex. Laser made an appearance in its topic only in 1980
July 7th, 2013 © 2013 Axiomine LLC
Example based on Sarah Palin’s email corpus
• Analyze emails which below to Trig/Family/Inspiration
topics
• Spike in April 2008 – Remarkably (for Topic Modeling) and
Unsurprisingly (for common sense), this was exactly the month
Trig was born.
• Topic Modeling can discover such patterns from a large Text
Corpus without requiring a human to read the entire corpus.
July 7th, 2013 © 2013 Axiomine LLC
Topic Modeling Toolkits
• Several Open Source Options exist
Library Name URL
Mallet MALLET is a Java-based package for statistical natural language processing,
document classification, clustering, topic modeling, information extraction,
and other machine learning applications to text.
R Based Library R based library to perform Topic Modeling
Apache Mahout Big Data solution of Topic Modeling. Why is it needed?
• Topic Modeling is computationally expensive
• Requires large amounts of memory
• Requires considerable computational power
• Memory is bigger constraint
• Most implementations run out of memory when applied on even a
modest number of documents (50,000 to 100,000 documents)
• If they do not run out of memory they slow down to a crawl due to
frequent Garbage Collection (in Java based environment)
• A Big Data based approach is needed!
July 7th, 2013 © 2013 Axiomine LLC
Mahout for Big LDA
• Apache Mahout
• Hadoop MapReduce based suite of Machine Learning procedures
• Implements several Machine Learning routines which are based on
Bayesian techniques (Ex. Generative Algorithms)
• Generative Algorithms are iterative and iterations converge to a solution
• Each iteration needs the results produced by the previous iteration.
Hence Iterations cannot be executed in parallel
• Several iterations (a few thousand) are needed to converge to a
solution
• Mahout uses Map-Reduce to parallelize a single iteration
• Each iteration is a separate Map-Reduce job
• Inter-Iteration communication using HDFS. Leads to high I/O
• High I/O compounded by multi-iteration nature
• Mahout based LDA
• Each iteration is slower to accommodate large memory requirements
• Typically 1000 iterations needed. Takes too long to run. Unsuitable
for exploratory analysis
• Lesser iterations lead to sub-optimal solution
July 7th, 2013 © 2013 Axiomine LLC
Parallel LDA based on Mallet
• A Parallel LDA in Mallet is based on
• Newman, Asuncion, Smyth and Welling, Distributed
Algorithms for Topic Models JMLR (2009), with SparseLDA
sampling scheme and data structure from Yao, Mimno and
McCallum, Efficient Methods for Topic Model Inference on
Streaming Document Collections, KDD (2009)
• Still memory intensive
• Large corpus leads to frequent Garbage Collection
• Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad
Core processor on 500,000 US Patent abstracts 400
minutes of processing for 1000 iterations.
• The application makes no progress for 1 Million Patents and
eventually runs out of memory or stalls due to frequent
Garbage Collection
July 7th, 2013 © 2013 Axiomine LLC
Axiomine Solution – Big LDA without Hadoop
• Map-Reduce is unsuitable for LDA type Algorithms
• Hadoop is complex and unsuited for ad-hoc analysis
• Large number of sequential iterations only allows Map-Reduce to be
used at Iteration level. Leads to too many short Map-Reduce jobs
• Large scale LDA without Big Data
• LDA is a memory intensive process
• Off-Heap memory based on Java NIO allows processes to use
memory without incurring GC penalty.
• Trade-off is slightly lower performance
• Exploit the OS page-caching to use off-heap memory
• LDA operates on Text data. But soring text is orders of magnitude
more expensive as compared to storing numbers
• Massive off-heap memory based indexes which map words to
numbers allow significant lowering of memory usage
• Reorganizing the Mallet implementation steps achieved significant
performance gains and memory savings
July 7th, 2013 © 2013 Axiomine LLC
Axiomine Solution – Performance Numbers
Machine Type Corpus Performance
Single 8 GB, Intel I-7 Quad-
core machine
500000 US Patent Abstracts,
600
1000 Iterations completed in 2
hours
Amazon AWS hs1.8xlarge
machine
(http://aws.amazon.com/ec2/in
stance-types/)
2.1 Million US Patent
Abstracts, 600 topics using 5
CPU threads
1000 Iterations completed in
approximately 5 hours.
• High Points
• Scaling is practically linear unlike other implementations
• Each iteration takes between 7-15 seconds
• We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
• But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
• Big Topic Modeling without Big Data!!
• At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
• We want to bring Large Scale Exploratory Analysis at low complexity
July 7th, 2013 © 2013 Axiomine LLC
Conclusion – Large Scale Analysis without Big Data
• The Axiomine LDA implementation has the following
benefits
• Scaling is practically linear unlike other implementations
• Each iteration takes between 7-15 seconds
• We contemplated Apache HAMA to achieve parallelism without
incurring the disk I/O cost of Hadoop Map-Reduce
• But Network I/O will ensure worse intra-iteration performance than
we could achieve on a single machine!
• Big Topic Modeling without Big Data!!
• At Axiomine we intend to port more such popular Algorithms based
on lessons learned while porting LDA
• We want to bring Large Scale Exploratory Analysis at low complexity
July 7th, 2013 © 2013 Axiomine LLC

More Related Content

Similar to Large scale topic modeling

Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
Amazon Web Services
 

Similar to Large scale topic modeling (20)

Coping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited StorageCoping Strategies for the Death of Unlimited Storage
Coping Strategies for the Death of Unlimited Storage
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
data analytics lecture3.ppt
data analytics lecture3.pptdata analytics lecture3.ppt
data analytics lecture3.ppt
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
DataIntensiveComputing.pdf
DataIntensiveComputing.pdfDataIntensiveComputing.pdf
DataIntensiveComputing.pdf
 
Scaling Databases On The Cloud
Scaling Databases On The CloudScaling Databases On The Cloud
Scaling Databases On The Cloud
 
Scaing databases on the cloud
Scaing databases on the cloudScaing databases on the cloud
Scaing databases on the cloud
 
Machine Learning Using Cloud Services
Machine Learning Using Cloud ServicesMachine Learning Using Cloud Services
Machine Learning Using Cloud Services
 
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
Power Platform Leeds - November 2019 - Microsoft Ignite AnnouncementsPower Platform Leeds - November 2019 - Microsoft Ignite Announcements
Power Platform Leeds - November 2019 - Microsoft Ignite Announcements
 
Storage tiering for Oracle Database on AWS and Oracle EBusiness Suite on AWS ...
Storage tiering for Oracle Database on AWS and Oracle EBusiness Suite on AWS ...Storage tiering for Oracle Database on AWS and Oracle EBusiness Suite on AWS ...
Storage tiering for Oracle Database on AWS and Oracle EBusiness Suite on AWS ...
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Cloud-native persistence in a serverless world
Cloud-native persistence in a serverless worldCloud-native persistence in a serverless world
Cloud-native persistence in a serverless world
 
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
(ISM213) Building and Deploying a Modern Big Data Architecture on AWS
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
Dal deck
Dal deckDal deck
Dal deck
 
Presentation on Hadoop Technology
Presentation on Hadoop TechnologyPresentation on Hadoop Technology
Presentation on Hadoop Technology
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Large scale topic modeling

  • 1. Large Scale Topic Modeling By - Sameer Wadkar Big Data Architect / Data Scientist July 7th, 2013 © 2013 Axiomine LLC
  • 2. What is Topic Modeling • Technique is called Latent Dirichlet Allocation (LDA) • An excellent explanation is available in the following blog article by Edwin Chen from Google (http://blog.echen.me/2011/06/27/topic-modeling-the- sarah-palin-emails/) • This presentation borrows heavily from the blog article to explain the basics of Topic Modeling July 7th, 2013 © 2013 Axiomine LLC
  • 3. Brief Overview of LDA • What can LDA do? • LDA extracts key topics and themes from a large corpus of text • Each topic is a ordered list of representative words (Order is based on importance of word to a Topic) • LDA describes each document in the corpus based on allocation to the extracted topics. • It is an Unsupervised Learning Technique • No extensive preparation needed to create a training dataset • Easy to apply for exploratory analysis July 7th, 2013 © 2013 Axiomine LLC
  • 4. LDA – A Quick Example “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, contains words Bieber, Gaga , radio ) and 25% about cars (a topic which contains words driving and cars ). July 7th, 2013 © 2013 Axiomine LLC
  • 5. Sarah Palin Email Corpus • Sarah Palin Email Corpus • In June 2011 several thousand emails from Sarah Palin’s time as governor of Alaska were released (http://sunlightfoundation.com/blog/2011/06/15/sarahs-inbox/) • Emails were not organized in any form • The Edwin Chen blog article discusses how LDA was used to organize these emails in categories discovered from the Email Corpus using LDA. July 7th, 2013 © 2013 Axiomine LLC
  • 6. LDA Analysis Results Wildlife/ BP Corrosion •game •fish •moose •wildlife, hunting •bears •polar •bear •subsistence •management •area •board •hunt •wolves •control •department •year •use •wolf •habitat •hunters •caribou • program •Fishing….. Energy/ Fuel/ Oil Mining •energy •fuel •costs •oil •alaskans •prices •cost •nome •Now •high •being •home •public •power •mine •crisis •price •resource •need •community •fairbanks •rebate •use •mining •Villages … Trig/ Family/ Inspiration •family •web •mail •god •son •from •congratulations • children •life •child •down •trig •baby •birth •love •You •syndrome •very •special •bless •old •husband •years •thank •best … Gas •gas •oil •pipeline •agia •project •natural •north •producers •companies •tax •company •energy •development •slope •production •resources •line •gasline •transcanada •said •billion •plan •administration •million •industry, … Education/ Waste •school •waste •education •students •schools •million •read •email •market •policy •student •year •high •news •states •program •first •report •business •management •bulletin •information •reports •2008 •quarter … Presidential Campaign/ Elections •mail •web •from •thank, you •box •mccain •sarah •very •good •great •john •hope •president •sincerely •wasilla •work •keep •make •add •family •republican •support •doing •p.o, … • LDA Analysis of Sarah Palin’s emails discovered the following topics (notice the ordered list of words) July 7th, 2013 © 2013 Axiomine LLC
  • 7. Temporal Extraction MethodologyLDA Sample from Wildlife topic July 7th, 2013 © 2013 Axiomine LLC
  • 8. Temporal Extraction MethodologyLDA Sample from multiple topics LDA classification of above email Topic Allocation Percentage Presidential Campaign/ Elections 10% Wildlife 90% July 7th, 2013 © 2013 Axiomine LLC
  • 9. Types of Analysis LDA can perform • Similarity Analysis • Which topics are similar? • Which documents are similar based on Topic Allocations? • LDA can distinguish between business articles related to “Mergers” from those related to “Quarterly Earnings” which leads to more potent Similarity Analysis • LDA determines Topic Allocation based on collocation of word groups. Hence “IBM” and “Microsoft” documents can be discovered to be similar if they talk about similar computing topics • Similarity Analysis based on LDA is very accurate since • LDA converts the high-dimensional and noisy space of Word/Document allocations into a low dimensional Topic/Document allocations. July 7th, 2013 © 2013 Axiomine LLC
  • 10. Brief Overview of LDA • Topic Co-occurance • Do certain topics occur together in documents? • Analysis of software resumes will reveal that “Object Oriented Language” skills typically co-occur with “SQL and RDBMS skills” • Does Topic Co-occurance change with time? • Resume corpus would reveal that “Java” skills was highly correlated with “Flash Development” skills in 2007. In 2013 the correlation has shifted to “Java” and “HTML5” but not as much as in 2007 indicating that HTML5 is a more specialized skill July 7th, 2013 © 2013 Axiomine LLC
  • 11. Brief Overview of LDA • Time based Analysis • For a corpus which covers documents over time, do certain topics appear over time • How does appearance of new topics affect the distribution of other topics • Analysis of science articles from the Journal of Science (1880-2002) reveals this process • http://topics.cs.princeton.edu/Science/ • The Browser is at http://topics.cs.princeton.edu/Science/browser/ • 75 topic model • Demonstrates how Topics gain/lose prominence over time • Demonstrate how a Topic composition changes over time • Demonstrates how new Topics appear • Ex. Laser made an appearance in its topic only in 1980 July 7th, 2013 © 2013 Axiomine LLC
  • 12. Example based on Sarah Palin’s email corpus • Analyze emails which below to Trig/Family/Inspiration topics • Spike in April 2008 – Remarkably (for Topic Modeling) and Unsurprisingly (for common sense), this was exactly the month Trig was born. • Topic Modeling can discover such patterns from a large Text Corpus without requiring a human to read the entire corpus. July 7th, 2013 © 2013 Axiomine LLC
  • 13. Topic Modeling Toolkits • Several Open Source Options exist Library Name URL Mallet MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text. R Based Library R based library to perform Topic Modeling Apache Mahout Big Data solution of Topic Modeling. Why is it needed? • Topic Modeling is computationally expensive • Requires large amounts of memory • Requires considerable computational power • Memory is bigger constraint • Most implementations run out of memory when applied on even a modest number of documents (50,000 to 100,000 documents) • If they do not run out of memory they slow down to a crawl due to frequent Garbage Collection (in Java based environment) • A Big Data based approach is needed! July 7th, 2013 © 2013 Axiomine LLC
  • 14. Mahout for Big LDA • Apache Mahout • Hadoop MapReduce based suite of Machine Learning procedures • Implements several Machine Learning routines which are based on Bayesian techniques (Ex. Generative Algorithms) • Generative Algorithms are iterative and iterations converge to a solution • Each iteration needs the results produced by the previous iteration. Hence Iterations cannot be executed in parallel • Several iterations (a few thousand) are needed to converge to a solution • Mahout uses Map-Reduce to parallelize a single iteration • Each iteration is a separate Map-Reduce job • Inter-Iteration communication using HDFS. Leads to high I/O • High I/O compounded by multi-iteration nature • Mahout based LDA • Each iteration is slower to accommodate large memory requirements • Typically 1000 iterations needed. Takes too long to run. Unsuitable for exploratory analysis • Lesser iterations lead to sub-optimal solution July 7th, 2013 © 2013 Axiomine LLC
  • 15. Parallel LDA based on Mallet • A Parallel LDA in Mallet is based on • Newman, Asuncion, Smyth and Welling, Distributed Algorithms for Topic Models JMLR (2009), with SparseLDA sampling scheme and data structure from Yao, Mimno and McCallum, Efficient Methods for Topic Model Inference on Streaming Document Collections, KDD (2009) • Still memory intensive • Large corpus leads to frequent Garbage Collection • Executing Mallet ParallelTopicModel on 8 GB, Intel I-7 Quad Core processor on 500,000 US Patent abstracts 400 minutes of processing for 1000 iterations. • The application makes no progress for 1 Million Patents and eventually runs out of memory or stalls due to frequent Garbage Collection July 7th, 2013 © 2013 Axiomine LLC
  • 16. Axiomine Solution – Big LDA without Hadoop • Map-Reduce is unsuitable for LDA type Algorithms • Hadoop is complex and unsuited for ad-hoc analysis • Large number of sequential iterations only allows Map-Reduce to be used at Iteration level. Leads to too many short Map-Reduce jobs • Large scale LDA without Big Data • LDA is a memory intensive process • Off-Heap memory based on Java NIO allows processes to use memory without incurring GC penalty. • Trade-off is slightly lower performance • Exploit the OS page-caching to use off-heap memory • LDA operates on Text data. But soring text is orders of magnitude more expensive as compared to storing numbers • Massive off-heap memory based indexes which map words to numbers allow significant lowering of memory usage • Reorganizing the Mallet implementation steps achieved significant performance gains and memory savings July 7th, 2013 © 2013 Axiomine LLC
  • 17. Axiomine Solution – Performance Numbers Machine Type Corpus Performance Single 8 GB, Intel I-7 Quad- core machine 500000 US Patent Abstracts, 600 1000 Iterations completed in 2 hours Amazon AWS hs1.8xlarge machine (http://aws.amazon.com/ec2/in stance-types/) 2.1 Million US Patent Abstracts, 600 topics using 5 CPU threads 1000 Iterations completed in approximately 5 hours. • High Points • Scaling is practically linear unlike other implementations • Each iteration takes between 7-15 seconds • We contemplated Apache HAMA to achieve parallelism without incurring the disk I/O cost of Hadoop Map-Reduce • But Network I/O will ensure worse intra-iteration performance than we could achieve on a single machine! • Big Topic Modeling without Big Data!! • At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA • We want to bring Large Scale Exploratory Analysis at low complexity July 7th, 2013 © 2013 Axiomine LLC
  • 18. Conclusion – Large Scale Analysis without Big Data • The Axiomine LDA implementation has the following benefits • Scaling is practically linear unlike other implementations • Each iteration takes between 7-15 seconds • We contemplated Apache HAMA to achieve parallelism without incurring the disk I/O cost of Hadoop Map-Reduce • But Network I/O will ensure worse intra-iteration performance than we could achieve on a single machine! • Big Topic Modeling without Big Data!! • At Axiomine we intend to port more such popular Algorithms based on lessons learned while porting LDA • We want to bring Large Scale Exploratory Analysis at low complexity July 7th, 2013 © 2013 Axiomine LLC