SlideShare a Scribd company logo
VerticalScope
Scalable Online Learning of
Topic Models with Spark
Alex Minnaar, Software Engineer
VerticalScope
VerticalScope Overview
Technology
Hobbies & Collectibles
Health & Wellness
Home & Garden
1,250 Community Sites
15+ Content Portals
42 Million Registered Members
84 Million Monthly Visitors
540 Million Monthly Pageviews
1.2 Billion Posts
2 Billion Monthly Ad Impressions
VerticalScope
What are People Talking About?
• There are thousands of conversations and discussions happening
everyday on a variety of topics.
• Goal: Discover what these topics of conversation are.
• Allows us to understand what community members are
interested in.
• Allows us to classify forum threads by topic.
• Applications in question answering, named entity recognition,
trend analysis etc.
• But how do we define topics?
!3
VerticalScope
Topic Modelling
• Most common Topic Modelling algorithm is Latent Dirichlet Allocation (LDA).
• Given a corpus of documents,
• Topics are defined as distributions over words.
• Documents are defined as distributions over topics.
!4
Probabilistic Topic Models, David Blei
VerticalScope
Topic Modelling (continued)
• If done correctly, high probability words in a topic will be
thematically coherent. e.g. a topic about “batteries”
!
!
!
• Similarly, high probability topics in a document indicate themes
that are present in the document.
!5
battery charge voltage charger current power draw ...
VerticalScope
The LDA Algorithm
• Compute an approximation to the intractable probability distribution of interest
by optimizing a loss function (variational inference). Optimization is done via
EM-style coordinate ascent.
!6
make a pass through
all documents in corpus
for each global topic update
global topic update
VerticalScope
LDA Algorithm (continued)
!7
pw1,t1 pw1,t2 … pw1,tk
pw2,t1 pw2,t2 …
!
pw2,tk
… … … …
pwv,t1
!
pwv,t2 … pwv,tk
t1 t2 … tK
w1
w2
wV
…
Words
Topics probability of
w1 in topic k
VerticalScope
Advantages and Disadvantages of
LDA
!
Completely unsupervised
• Can learn topics without the
need for annotated training data.
Intuitive
• Forum threads can intuitively be
thought of a documents in a
corpus.
Built-in classification
• Documents are distributions
over topics. Can classify
documents by high probability
topics.
!8
!
Not scalable
• Each global topic update
requires one full pass over the
corpus. Entire corpus must fit in
main memory. Not feasible for
us.
Inefficient to update a model
• New threads are constantly
being created. Want to update
model, not re-run it on entire
corpus.
Advantages Disadvantages
VerticalScope
New Strategy: Online LDA
!9
Full Corpus
sample minibatch
of documents Update LDA
Model
E-Step + M-Step
minibatch
repeat until convergence
Based on Online Learning for Latent Dirichlet Allocation by Hoffman et al.
same E-Step as before but applied to minibatch
rather than entire corpus
sequentially merge minibatch topics
with a global topic matrix
VerticalScope
Online LDA (continued)
• Scalable in terms of memory.
• Only one minibatch needs to be in memory at each update
step (rather than entire corpus).
• Can scale to massive corpora on a single machine!
• Updating the model is easy.
• Just treat new documents as another minibatch.
• But still not as scalable as we would like…
!10
VerticalScope
Scaling Online LDA
• Online LDA is scalable in terms of memory but it is still very
slow for large corpora.
• We want fast model prototyping. We don’t want to wait hours to
learn a model.
• New threads are constantly being created. We would like the
ability to update our model quickly.
• We can speed things up if we process each minibatch in a
distributed fashion.
!11
VerticalScope
Distributed Online LDA
!12
update LDA
model
Distributed E-Step +
M-Step
Full Corpus
sample minibatch
of documents
minibatch
repeat until convergence
E-Step now performed in
parallel for each document
in the minibatch then aggregated
i.e. Map then Reduce
VerticalScope
Spark Basics
• All distributed computations are done on RDDs (Resilient
Distributed Datasets).
• Main operations are
• Transformations: create a new RDD from an old one e.g.
map, filter, flatMap etc.
• Actions: return an RDD computation to driver e.g. reduce,
count, collect etc.
!13
VerticalScope
Spark Basics (continued)
• RDDs are kept in memory as much as possible which minimizes
disk serialization which makes distributed iterative algorithms
much faster than Spark’s predecessors.
!14
Hadoop
Spark
VerticalScope
Distributed Online LDA with Spark
• Each minibatch is represented by an RDD of documents.
• The LDA model that is being updated is represented by an MLlib
indexedRowMatrix (an RDD of matrix rows).
• The E-Step (getting the “intermediate” topics for the minibatch) is
performed in parallel with a map followed by a reduce.
• The M-Step is performed by a join between the E-Step’s
“intermediate” topics and the current LDA model.
!15
VerticalScope
Distributed Online LDA with Spark
(continued)
!16
doc 1
doc 2
…
doc n
(w1,1, freq),…
(w2,1, freq),…
…
(wn,1, freq),…
(w1,1, freq, row1),…
(w2,1, freq, row1),…
…
(wn,1, freq, row1),…
(w1,1, row_update1),…
(w2,1, row_update1),…
…
(wn,1, row_update1),…
(w1, update_sum),…
…
(wS, update_sum),…
(w1, new_row)
(w2, new_row)
…
(wV, new_row)
minibatch
RDD “Intermediate”
topics
updated
model
(indexedRowMatrix)
put docs in
bag-of-words
form
join words with
rows from current
model
perform E-Step
Aggregate row updates
(flatMap +
reduceByKey)
merge updates
with current
rows (M-Step)
word 1 in
document 1
row of LDA
model for w1
S = unique words
in minibatch
VerticalScope
Results: Topic Discovery
!17
Lights lights, bulbs, headlights, light, fog, bright, led, …
Batteries battery, charge, voltage, alternator, charging, current, volt, …
Tires tires, set, tread, good, traction, sidewall, rims, grip,…
Wiring wire, wiring, harness, ground, connector, circuit, pin,…
Wipers windshield, washer, wiper, wipers, blades, rain, windscreen,..
Gas gas, tank, gallon, station, gallons, tanks, filled, filler, …
Fuel Pumps fuel, pump, pumps, regulator, rail, injection, tank,…
Doors door, doors, open, lock, handle, latch, locks, closed, hatch,…
Towing trailer, tow, towing, hitch, boat, load, camper, pull, loaded,…
Mileage mileage, mpg, miles, highway, average, fuel, economy,…
Pricing price, buy, deal, good, cost, sell, worth, buying, prices,…
Auto-related Topics
VerticalScope
Topic Discovery (continued)
!18
General Problems issue, problem, issues, fixed, solved, common,…
Sports play, team, game, win, playing, played, games,…
Politics vote, speech, party, media, country, news, obama,…
Law Enforcement law, police, cops, legal, laws, illegal, court, officer,…
Movies/TV movie, series, movies, favorite,tv, watch, episodes,…
Non-Automotive Topics
Spam Topics
Spam/Documents real, fake, passports, documents, licenses, driver,…
Spam/Health treat, skin, infections, cure, ear, urinary, allergic,…
Spam/
Pharmaceuticals
prescription, online, buy ,rx, generic, delivery,…
VerticalScope
Results: Topic Detection
!19
2000XJ with (120k original) 9k on a reman engine. I've heard a belt squeak sound develop over the past
couple of weeks. I previously replaced the belt, water pump, power steering pump when i had the
engine replaced back in March. I checked the tension on the belt and it was at 150 lbs (I checked from 3
different places - all consistent). I did get the warranty on the reman so if this is warranty work, my shop
will take care of it, or most of it. I'm just unsure what this problem could be. It sounds more like
scrapping than belt chirping and it sounds like it's in the middle of the engine - not the PS Pulley or AC on
the ends. Should I first try to tighten up from the 150 pounds on the belt? I'd normally attempt this
before posting, but it is within spec for used belt (140-160) and it may be warranty work and i'd rather
have the shop take a look since I have a hectic work schedule. Any thoughts?
i had to replace the belt, alternator, and the tensioner a few times because of squeaks on my old xj. i
don't know why but after a few swaps it got a set it liked and stopped making noise
Yeah, i don't get it. Could this cause any damage? The belt looks fine and the noise developed over the
past 2 weeks. It sounds like its coming out of the center pulleys.
I have the same issue. I've replaced the idler pulley, power steering pump and two belts. Still have a
slight squeak. Sounds like it's coming from the water pump. Could it be the water pump squeaking? Is this
common?
I replaced that with the reman engine. Very odd. I'm just worried i'm doing additional damage. i didn't
have a chance to fully check it out this weekend but i'll take a look next weekend.
Just a word of caution. Last august I was on my way home from town 25 mile trip one way. I had the
power steering pulley explode and destroy the electric fan blade. Apparently some of the newer ones are
made of plastic. Atleast it looks like plastic. I fought mine with the squeaks as well even with a new belt.
Dayco makes a tension gauge. I just kept putting tension on till it wouldn't squeal with the a.c. on and
revving the engine. Just a final note. I noticed a few days before the pulley broke that there were cracks
in the center of the pulley where it meets the hub. Broke before I could replace it.
So not only is this the second belt in a week to squeak but now this is the second one the shred while
driving.
belt, pulley, belts, water,
pulleys, pump,…
noise, sound, hear, sounds,
vibration, loud, rattle,…
replace, replacement, mechanic,
repair, rebuild,…
seal, bearings, seals, rubber,
rear, assembly,…
clean, water, dry, cleaning,
wash, cleaner, dirty,…
Topics Detected in Thread
Example Thread
VerticalScope
Other Uses for Topics
• Obviously topic discovery and thread classification through topic
detection but also…
• Spam detection.
• Threads with high probability of spam topics probably contain
spam!
• Topic modelling + NER.
• Which automotive makes/models are mentioned in which
topics?
• Topic modelling + NER + anomaly detection.
• detecting sudden spikes in automotive make/model mentions in
a particular topic.
!20
VerticalScope
Alex Minnaar
Software Engineer
!
Email : aminnaar@verticalscope.com
Contact Us

More Related Content

Similar to VS_TextByTheBay

Tesla switch
Tesla switchTesla switch
Tesla switch
ipascu
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
parallellabs
 

Similar to VS_TextByTheBay (20)

Quality code 2019
Quality code 2019Quality code 2019
Quality code 2019
 
Agile Engineering for Managers Workshop
Agile Engineering for Managers WorkshopAgile Engineering for Managers Workshop
Agile Engineering for Managers Workshop
 
PLAN Tech Day 2016
PLAN Tech Day 2016PLAN Tech Day 2016
PLAN Tech Day 2016
 
Re-inventing the Database: What to Keep and What to Throw Away
Re-inventing the Database: What to Keep and What to Throw AwayRe-inventing the Database: What to Keep and What to Throw Away
Re-inventing the Database: What to Keep and What to Throw Away
 
John P. Uyemura - Cmos Logic Circuit Design (1999, Springer) - libgen.lc.pdf
John P. Uyemura - Cmos Logic Circuit Design (1999, Springer) - libgen.lc.pdfJohn P. Uyemura - Cmos Logic Circuit Design (1999, Springer) - libgen.lc.pdf
John P. Uyemura - Cmos Logic Circuit Design (1999, Springer) - libgen.lc.pdf
 
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, EuropePatterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
 
Tesla switch
Tesla switchTesla switch
Tesla switch
 
Solar pv design1012
Solar pv design1012Solar pv design1012
Solar pv design1012
 
Solar pv design1012
Solar pv design1012Solar pv design1012
Solar pv design1012
 
Pitfalls of Object Oriented Programming
Pitfalls of Object Oriented ProgrammingPitfalls of Object Oriented Programming
Pitfalls of Object Oriented Programming
 
A friendly introduction to differential equations
A friendly introduction to differential equationsA friendly introduction to differential equations
A friendly introduction to differential equations
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010Chalmers microprocessor sept 2010
Chalmers microprocessor sept 2010
 
GRAMMAR GRADE 3.pdf
GRAMMAR GRADE 3.pdfGRAMMAR GRADE 3.pdf
GRAMMAR GRADE 3.pdf
 
Scholastic_Success_Grammer_Grade_3.pdf
Scholastic_Success_Grammer_Grade_3.pdfScholastic_Success_Grammer_Grade_3.pdf
Scholastic_Success_Grammer_Grade_3.pdf
 
Agile in a Nutshell
Agile in a NutshellAgile in a Nutshell
Agile in a Nutshell
 
25
2525
25
 
Distributed processing of large graphs in python
Distributed processing of large graphs in pythonDistributed processing of large graphs in python
Distributed processing of large graphs in python
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
 
Coaching teams in creative problem solving
Coaching teams in creative problem solvingCoaching teams in creative problem solving
Coaching teams in creative problem solving
 

VS_TextByTheBay

  • 1. VerticalScope Scalable Online Learning of Topic Models with Spark Alex Minnaar, Software Engineer
  • 2. VerticalScope VerticalScope Overview Technology Hobbies & Collectibles Health & Wellness Home & Garden 1,250 Community Sites 15+ Content Portals 42 Million Registered Members 84 Million Monthly Visitors 540 Million Monthly Pageviews 1.2 Billion Posts 2 Billion Monthly Ad Impressions
  • 3. VerticalScope What are People Talking About? • There are thousands of conversations and discussions happening everyday on a variety of topics. • Goal: Discover what these topics of conversation are. • Allows us to understand what community members are interested in. • Allows us to classify forum threads by topic. • Applications in question answering, named entity recognition, trend analysis etc. • But how do we define topics? !3
  • 4. VerticalScope Topic Modelling • Most common Topic Modelling algorithm is Latent Dirichlet Allocation (LDA). • Given a corpus of documents, • Topics are defined as distributions over words. • Documents are defined as distributions over topics. !4 Probabilistic Topic Models, David Blei
  • 5. VerticalScope Topic Modelling (continued) • If done correctly, high probability words in a topic will be thematically coherent. e.g. a topic about “batteries” ! ! ! • Similarly, high probability topics in a document indicate themes that are present in the document. !5 battery charge voltage charger current power draw ...
  • 6. VerticalScope The LDA Algorithm • Compute an approximation to the intractable probability distribution of interest by optimizing a loss function (variational inference). Optimization is done via EM-style coordinate ascent. !6 make a pass through all documents in corpus for each global topic update global topic update
  • 7. VerticalScope LDA Algorithm (continued) !7 pw1,t1 pw1,t2 … pw1,tk pw2,t1 pw2,t2 … ! pw2,tk … … … … pwv,t1 ! pwv,t2 … pwv,tk t1 t2 … tK w1 w2 wV … Words Topics probability of w1 in topic k
  • 8. VerticalScope Advantages and Disadvantages of LDA ! Completely unsupervised • Can learn topics without the need for annotated training data. Intuitive • Forum threads can intuitively be thought of a documents in a corpus. Built-in classification • Documents are distributions over topics. Can classify documents by high probability topics. !8 ! Not scalable • Each global topic update requires one full pass over the corpus. Entire corpus must fit in main memory. Not feasible for us. Inefficient to update a model • New threads are constantly being created. Want to update model, not re-run it on entire corpus. Advantages Disadvantages
  • 9. VerticalScope New Strategy: Online LDA !9 Full Corpus sample minibatch of documents Update LDA Model E-Step + M-Step minibatch repeat until convergence Based on Online Learning for Latent Dirichlet Allocation by Hoffman et al. same E-Step as before but applied to minibatch rather than entire corpus sequentially merge minibatch topics with a global topic matrix
  • 10. VerticalScope Online LDA (continued) • Scalable in terms of memory. • Only one minibatch needs to be in memory at each update step (rather than entire corpus). • Can scale to massive corpora on a single machine! • Updating the model is easy. • Just treat new documents as another minibatch. • But still not as scalable as we would like… !10
  • 11. VerticalScope Scaling Online LDA • Online LDA is scalable in terms of memory but it is still very slow for large corpora. • We want fast model prototyping. We don’t want to wait hours to learn a model. • New threads are constantly being created. We would like the ability to update our model quickly. • We can speed things up if we process each minibatch in a distributed fashion. !11
  • 12. VerticalScope Distributed Online LDA !12 update LDA model Distributed E-Step + M-Step Full Corpus sample minibatch of documents minibatch repeat until convergence E-Step now performed in parallel for each document in the minibatch then aggregated i.e. Map then Reduce
  • 13. VerticalScope Spark Basics • All distributed computations are done on RDDs (Resilient Distributed Datasets). • Main operations are • Transformations: create a new RDD from an old one e.g. map, filter, flatMap etc. • Actions: return an RDD computation to driver e.g. reduce, count, collect etc. !13
  • 14. VerticalScope Spark Basics (continued) • RDDs are kept in memory as much as possible which minimizes disk serialization which makes distributed iterative algorithms much faster than Spark’s predecessors. !14 Hadoop Spark
  • 15. VerticalScope Distributed Online LDA with Spark • Each minibatch is represented by an RDD of documents. • The LDA model that is being updated is represented by an MLlib indexedRowMatrix (an RDD of matrix rows). • The E-Step (getting the “intermediate” topics for the minibatch) is performed in parallel with a map followed by a reduce. • The M-Step is performed by a join between the E-Step’s “intermediate” topics and the current LDA model. !15
  • 16. VerticalScope Distributed Online LDA with Spark (continued) !16 doc 1 doc 2 … doc n (w1,1, freq),… (w2,1, freq),… … (wn,1, freq),… (w1,1, freq, row1),… (w2,1, freq, row1),… … (wn,1, freq, row1),… (w1,1, row_update1),… (w2,1, row_update1),… … (wn,1, row_update1),… (w1, update_sum),… … (wS, update_sum),… (w1, new_row) (w2, new_row) … (wV, new_row) minibatch RDD “Intermediate” topics updated model (indexedRowMatrix) put docs in bag-of-words form join words with rows from current model perform E-Step Aggregate row updates (flatMap + reduceByKey) merge updates with current rows (M-Step) word 1 in document 1 row of LDA model for w1 S = unique words in minibatch
  • 17. VerticalScope Results: Topic Discovery !17 Lights lights, bulbs, headlights, light, fog, bright, led, … Batteries battery, charge, voltage, alternator, charging, current, volt, … Tires tires, set, tread, good, traction, sidewall, rims, grip,… Wiring wire, wiring, harness, ground, connector, circuit, pin,… Wipers windshield, washer, wiper, wipers, blades, rain, windscreen,.. Gas gas, tank, gallon, station, gallons, tanks, filled, filler, … Fuel Pumps fuel, pump, pumps, regulator, rail, injection, tank,… Doors door, doors, open, lock, handle, latch, locks, closed, hatch,… Towing trailer, tow, towing, hitch, boat, load, camper, pull, loaded,… Mileage mileage, mpg, miles, highway, average, fuel, economy,… Pricing price, buy, deal, good, cost, sell, worth, buying, prices,… Auto-related Topics
  • 18. VerticalScope Topic Discovery (continued) !18 General Problems issue, problem, issues, fixed, solved, common,… Sports play, team, game, win, playing, played, games,… Politics vote, speech, party, media, country, news, obama,… Law Enforcement law, police, cops, legal, laws, illegal, court, officer,… Movies/TV movie, series, movies, favorite,tv, watch, episodes,… Non-Automotive Topics Spam Topics Spam/Documents real, fake, passports, documents, licenses, driver,… Spam/Health treat, skin, infections, cure, ear, urinary, allergic,… Spam/ Pharmaceuticals prescription, online, buy ,rx, generic, delivery,…
  • 19. VerticalScope Results: Topic Detection !19 2000XJ with (120k original) 9k on a reman engine. I've heard a belt squeak sound develop over the past couple of weeks. I previously replaced the belt, water pump, power steering pump when i had the engine replaced back in March. I checked the tension on the belt and it was at 150 lbs (I checked from 3 different places - all consistent). I did get the warranty on the reman so if this is warranty work, my shop will take care of it, or most of it. I'm just unsure what this problem could be. It sounds more like scrapping than belt chirping and it sounds like it's in the middle of the engine - not the PS Pulley or AC on the ends. Should I first try to tighten up from the 150 pounds on the belt? I'd normally attempt this before posting, but it is within spec for used belt (140-160) and it may be warranty work and i'd rather have the shop take a look since I have a hectic work schedule. Any thoughts? i had to replace the belt, alternator, and the tensioner a few times because of squeaks on my old xj. i don't know why but after a few swaps it got a set it liked and stopped making noise Yeah, i don't get it. Could this cause any damage? The belt looks fine and the noise developed over the past 2 weeks. It sounds like its coming out of the center pulleys. I have the same issue. I've replaced the idler pulley, power steering pump and two belts. Still have a slight squeak. Sounds like it's coming from the water pump. Could it be the water pump squeaking? Is this common? I replaced that with the reman engine. Very odd. I'm just worried i'm doing additional damage. i didn't have a chance to fully check it out this weekend but i'll take a look next weekend. Just a word of caution. Last august I was on my way home from town 25 mile trip one way. I had the power steering pulley explode and destroy the electric fan blade. Apparently some of the newer ones are made of plastic. Atleast it looks like plastic. I fought mine with the squeaks as well even with a new belt. Dayco makes a tension gauge. I just kept putting tension on till it wouldn't squeal with the a.c. on and revving the engine. Just a final note. I noticed a few days before the pulley broke that there were cracks in the center of the pulley where it meets the hub. Broke before I could replace it. So not only is this the second belt in a week to squeak but now this is the second one the shred while driving. belt, pulley, belts, water, pulleys, pump,… noise, sound, hear, sounds, vibration, loud, rattle,… replace, replacement, mechanic, repair, rebuild,… seal, bearings, seals, rubber, rear, assembly,… clean, water, dry, cleaning, wash, cleaner, dirty,… Topics Detected in Thread Example Thread
  • 20. VerticalScope Other Uses for Topics • Obviously topic discovery and thread classification through topic detection but also… • Spam detection. • Threads with high probability of spam topics probably contain spam! • Topic modelling + NER. • Which automotive makes/models are mentioned in which topics? • Topic modelling + NER + anomaly detection. • detecting sudden spikes in automotive make/model mentions in a particular topic. !20
  • 21. VerticalScope Alex Minnaar Software Engineer ! Email : aminnaar@verticalscope.com Contact Us