SlideShare a Scribd company logo
Real-Time Top-R Topic Detection on
Twitter with Topic Hijack Filtering
Kohei Hayashi (National Institute of Informatics)
August 24, 2016
Joint work with
• Takanori Maehara (Shizuoka Univ)
• Masashi Toyoda (Univ Tokyo)
• Ken-ichi Kawarabayashi (National Institute of
Informatics)
1 / 21
Twitter: The Rapid Stream of Texts
SNS with short messages (tweets)
Volume 41M users
Diversity Covering any topics: news, politics, TV, ...
Speed 270K tweets/min
A promising data source for topic detection
• May discover breaking news and events even faster
than news media 2 / 21
Two Challenges
1 Topic Detection in Real-time
2 Noise Filtering
3 / 21
Noise Filtering
Many spam tweets generated by not human
• e.g. “tweet buttons”
Exaggerate co-occurrence and “hijack” important topics
4 / 21
Contributions
A streaming topic detection algorithm based on
non-negative matrix factorization (NMF)
1 Highly scalable:
Able to deal with a 20M×1M sparse matrix/sec
2 Automatic topic hijacking detection & elimination
Technical Points
1 Reformulate NMF in a stochastic manner
• Stochastic gradient descent (SGD) updates with
O(NNZ) time
2 Use of statistical testing
• Assume normal topics follow power law
5 / 21
Streaming NMF
6 / 21
Topic Detection by NMF
Tweets
User-word
co-occurrence
user word
Frequency
table
0
0
Topics
user
word
7 / 21
Problem
min
U∈RI×R
+ ,V∈RJ×R
+
fλ(X; U, V),
fλ(X; U, V) =
1
2
X − UV 2
F +
λ
2
( U 2
F + V 2
F)
Batch Algorithm
Repeat until convergence:
1 U ← [U − η Ufλ(X; U, V)]+
2 V ← [V − η Vfλ(X; U, V)]+
Guaranteed converging to stationary points
8 / 21
In Twitter, observe X(t)
for each time t
• Consider to keep track of ¯X
(t)
= 1
t
t
s=1 X(s)
 Constructing ¯X
(t)
and solving fλ( ¯X
(t)
; U, V): time
consuming
Key idea: decompose fλ for each t
¯X
(t)
− UV 2
F =
1
t s
X(s)
− UV 2
F + const
Stochastic Formulation
If X(t)
is i.i.d. random variable,
E[X(t)
] − UV 2
F = E X(t)
− UV 2
F + var[X(t)
]
Now we can use SGD
9 / 21
Streaming Algorithm
For t = 1, . . . , T:
1 AU ← U Ufλ, AV ← V Vfλ (metrics)
2 U(t)
← [U(t−1)
− ηt Ufλ(X(t)
; U, V(t−1)
)A−1
U ]+
3 V(t)
← [V(t−1)
− ηt Vfλ(X(t)
; U(t)
, V)A−1
V ]+
 Guaranteed converging to the stationary points of
fλ( ¯X
(t)
; U, V) with some {ηt} and i.i.d. {X(t)
}
10 / 21
Comparing with the Batch NMF ...
Much faster
• Update U and V only once!
• Update cost: depends on NNZ(X(t)
) NNZ( ¯X
(t)
)
Memory efficient
• Able to discard X(1)
, . . . , X(t−1)
Smoothing effect
U(t)
= [(1 − ηt)U(t−1)
+ ηtX(t)
V(t−1)
A−1
U ]+
= [(1 − ηt)U(t−1)
+ ηt argmin
U
fλ(X(t)
; U, V(t−1)
)]+
• The ηt-weighted average between the prev solution
and the 1-step NMF solution of X(t)
• Mitigates the sparsity of X(t)
11 / 21
Topic Hijacking Detection
12 / 21
Problem Setting
Goal: Find hijacked topics
Idea: Check the distributions of all topics
Hypothesis
The word dist of a hijacked topic should be different
from the word dist of a normal topic
• NMF estimates topic-specific word dists as V
Xij ∝ p(useri, wordj)
∝
r
p(topicr)p(useri|topicr)p(wordj|topicr)
∝
r
uirvjr
13 / 21
Defining Normal/Hijacked Topics
Normal topics:
• Many users involve
⇒ Mixing many different vocabularies
⇒ Heavy-tailed (Zipf’s law)
Definition: A Normal Topic
Topic r is normal if
p(rank(word) | topicr)
= power(α)
14 / 21
Defining Normal/Hijacked Topics (Cont’d)
Hijacked topics:
• The same word set is repeatedly used
⇒ Uniform probs  (almost) no tail
Definition: A Hijacked Topic
Topic r is hijacked if
p(rank(word) | topicr)
= step(rankmin)
15 / 21
Log-likelihood Ratio Test
L(rankmin) =
j
log
step(rankj | rankmin)
power(rankj | ˆα)
Theorem (Asymptotic normality [Vuong’89])
Let N be # of observed words. Then, L(rankmin)/
√
N converges in
distribution to N(0, σ2
) where
σ2
=
1
N
N
j=1
log
step(rankj | rankmin)
power(rankj | ˆα)
2
−
1
N
L(rankmin)
2
.
For r = 1, . . . , R:
• Estimate ˆα = argmaxα log power(rank(uk) | α)
• For rankmin = 1, . . . , 140:
• Compute L(rankmin)
• Topic r is hijacked if p-val  0.05 16 / 21
Experiments
17 / 21
Data
Japanese Twitter stream
• April 15–16, 2013
• 417K users
• 1.98M words
• 15.3M tweets
• 69.4M co-occurrences
• Generated X(t)
for each 10K co-occurrences
18 / 21
Runtime
NMF 1% 10% 100%
Batch 27.0m 1.9h 16.4h
Online [Cao+ 07] 5.6h 9.2h 17.8h
Dynamic [Saha+ 12] 16.7h 24h 24h
Streaming [this work] 4.0m 21.7m 3.6h
w/ Filter [this work] 5.3m 24.1m 3.8h
 Streaming NMF: ×5–250 faster!
• 67K tweets/m
• The real-time speed of all JP tweets! (the 2nd most
popular language in Twitter)
 Filtering cost is ignorable
19 / 21
Perplexity
• Similarity between a topic dist and a target dist
• Used Y! headlines1
for the target term dist
• Lower is better
NMF 1% 10% 100%
Batch 9.01E+09 6.32E+06 4.51E+04
Online [Cao+ 07] 1.06E+04 3.25E+04 1.83E+04
Dynamic [Saha+ 12] 6.53E+04 N/A N/A
Streaming [this work] 5.65E+07 3.25E+04 8.71E+03
w/ Filter [this work] 5.25E+09 2.40E+04 7.90E+03
 Streaming NMF: best at 10%  100% data
 Topic hijacking filtering improves perplexity
1
http://news.yahoo.co.jp/list
20 / 21
Summary
Proposed the streaming algorithm for Twitter topic
detection
• Works in real time
(would handle all JP tweets in theory)
• Filters spam topics automatically
Code: github.com/hayasick/streamingNMF
Thank you!
21 / 21
Integrated Twitter Topic Detection System
For t = 1, . . . , T:
1 Generate X(t)
from tweets /∈ Blacklist
2 Update U, V by SGD
3 With some intervals,
Detect Topic Hijacking and update Blacklist
O(Nt)
O(NtR2
+ R3
)
O(J)
22 / 21

More Related Content

What's hot

Sparse fourier transform
Sparse fourier transformSparse fourier transform
Sparse fourier transform
Aarthi Raghavendra
 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Vincenzo Gulisano
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
Vincenzo Gulisano
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
MLconf
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identification
Kuldeep Jiwani
 
Control System toolbox in Matlab
Control System toolbox in MatlabControl System toolbox in Matlab
Control System toolbox in Matlab
Abdul Sami
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
GIUSEPPE DI BERNARDO
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
AIST
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
HONGJOO LEE
 
TIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in JuliaTIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in Julia
GapData Institute
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
Gael Varoquaux
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
MLconf
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Spark Summit
 
Algorithm: Quick-Sort
Algorithm: Quick-SortAlgorithm: Quick-Sort
Algorithm: Quick-Sort
Tareq Hasan
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
NECST Lab @ Politecnico di Milano
 
Distributed systems scheduling
Distributed systems schedulingDistributed systems scheduling
Distributed systems scheduling
Pragati Startup Presentation Designer firm
 
Implementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallelImplementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallel
Meenakshi Muthuraman
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics

What's hot (20)

Sparse fourier transform
Sparse fourier transformSparse fourier transform
Sparse fourier transform
 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
 
The data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architecturesThe data streaming processing paradigm and its use in modern fog architectures
The data streaming processing paradigm and its use in modern fog architectures
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identification
 
Control System toolbox in Matlab
Control System toolbox in MatlabControl System toolbox in Matlab
Control System toolbox in Matlab
 
Europy17_dibernardo
Europy17_dibernardoEuropy17_dibernardo
Europy17_dibernardo
 
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
Oleksandr Frei and Murat Apishev - Parallel Non-blocking Deterministic Algori...
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
TIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in JuliaTIM: Large-scale Energy Forecasting in Julia
TIM: Large-scale Energy Forecasting in Julia
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian ApproachAutomatic Features Generation And Model Training On Spark: A Bayesian Approach
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
 
Algorithm: Quick-Sort
Algorithm: Quick-SortAlgorithm: Quick-Sort
Algorithm: Quick-Sort
 
Self-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policiesSelf-adaptive container monitoring with performance-aware Load-Shedding policies
Self-adaptive container monitoring with performance-aware Load-Shedding policies
 
Distributed systems scheduling
Distributed systems schedulingDistributed systems scheduling
Distributed systems scheduling
 
Implementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallelImplementation of dijsktra’s algorithm in parallel
Implementation of dijsktra’s algorithm in parallel
 
Ph.D. Defense
Ph.D. DefensePh.D. Defense
Ph.D. Defense
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 

Viewers also liked

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Kohei Hayashi
 
Expected tensor decomposition with stochastic gradient descent
Expected tensor decomposition with stochastic gradient descentExpected tensor decomposition with stochastic gradient descent
Expected tensor decomposition with stochastic gradient descent
Kohei Hayashi
 
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
Division of Biomedical Informatics, UC San Diego
 
Personalized Filtering of Twitter Stream
Personalized Filtering of Twitter StreamPersonalized Filtering of Twitter Stream
Personalized Filtering of Twitter Stream
Pavan Kapanipathi
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Web Information Systems, TU Delft
 
Several Qualitative Observations on Independent Blind Shopping
Several Qualitative Observations on Independent Blind ShoppingSeveral Qualitative Observations on Independent Blind Shopping
Several Qualitative Observations on Independent Blind Shopping
Vladimir Kulyukin
 
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
Vladimir Kulyukin
 
PizzaRaptor
PizzaRaptorPizzaRaptor
PizzaRaptor
pizzaraptor
 
RFID in Robot-Assisted Indoor Navigation for the Visually Impaired
RFID in Robot-Assisted Indoor Navigation for the Visually ImpairedRFID in Robot-Assisted Indoor Navigation for the Visually Impaired
RFID in Robot-Assisted Indoor Navigation for the Visually Impaired
Vladimir Kulyukin
 
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the CloudAn Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
Vladimir Kulyukin
 
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
Vladimir Kulyukin
 
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleidingPresentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
Sasja Beerendonk
 
Stadsarchief Ieper zkt. publiek
Stadsarchief Ieper zkt. publiekStadsarchief Ieper zkt. publiek
Stadsarchief Ieper zkt. publiek
Archief 2.0
 
Op weg naar het zuidland
Op weg naar het zuidlandOp weg naar het zuidland
Op weg naar het zuidlandArchief 2.0
 
Jaslin's 21 day OBS journey
Jaslin's 21 day OBS journeyJaslin's 21 day OBS journey
Jaslin's 21 day OBS journeyTerence Lee
 
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced MapsA Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
Vladimir Kulyukin
 
Effective Nutrition Label Use on Smartphones
Effective Nutrition Label Use on SmartphonesEffective Nutrition Label Use on Smartphones
Effective Nutrition Label Use on Smartphones
Vladimir Kulyukin
 

Viewers also liked (20)

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal...
 
Expected tensor decomposition with stochastic gradient descent
Expected tensor decomposition with stochastic gradient descentExpected tensor decomposition with stochastic gradient descent
Expected tensor decomposition with stochastic gradient descent
 
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tr...
 
Personalized Filtering of Twitter Stream
Personalized Filtering of Twitter StreamPersonalized Filtering of Twitter Stream
Personalized Filtering of Twitter Stream
 
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web StreamsTwitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
Twitter, Twinder, Twitcident: Filtering and Search in Social Web Streams
 
Several Qualitative Observations on Independent Blind Shopping
Several Qualitative Observations on Independent Blind ShoppingSeveral Qualitative Observations on Independent Blind Shopping
Several Qualitative Observations on Independent Blind Shopping
 
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
Vision-Based Localization & Text Chunking of Nutrition Fact Tables on Android...
 
PizzaRaptor
PizzaRaptorPizzaRaptor
PizzaRaptor
 
Mengenal sunnah
Mengenal sunnahMengenal sunnah
Mengenal sunnah
 
RFID in Robot-Assisted Indoor Navigation for the Visually Impaired
RFID in Robot-Assisted Indoor Navigation for the Visually ImpairedRFID in Robot-Assisted Indoor Navigation for the Visually Impaired
RFID in Robot-Assisted Indoor Navigation for the Visually Impaired
 
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the CloudAn Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
An Algorithm for In-Place Vision-Based Skewed 1D Barcode Scanning in the Cloud
 
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
ShopTalk: Independent Blind Shopping Through Verbal Route Directions and Barc...
 
Mengenal sunnah
Mengenal sunnahMengenal sunnah
Mengenal sunnah
 
테스트용
테스트용테스트용
테스트용
 
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleidingPresentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
Presentatie Gastcollege Gamification voor Hogeschool Utrecht - ICT opleiding
 
Stadsarchief Ieper zkt. publiek
Stadsarchief Ieper zkt. publiekStadsarchief Ieper zkt. publiek
Stadsarchief Ieper zkt. publiek
 
Op weg naar het zuidland
Op weg naar het zuidlandOp weg naar het zuidland
Op weg naar het zuidland
 
Jaslin's 21 day OBS journey
Jaslin's 21 day OBS journeyJaslin's 21 day OBS journey
Jaslin's 21 day OBS journey
 
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced MapsA Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
A Software Tool for Rapid Acquisition of Streetwise Geo-Referenced Maps
 
Effective Nutrition Label Use on Smartphones
Effective Nutrition Label Use on SmartphonesEffective Nutrition Label Use on Smartphones
Effective Nutrition Label Use on Smartphones
 

Similar to Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering

Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
HamzaJaved306957
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
Krish_ver2
 
SP_BEE2143_C1.pptx
SP_BEE2143_C1.pptxSP_BEE2143_C1.pptx
SP_BEE2143_C1.pptx
IffahSkmd
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
Two Sigma
 
Dsp 1recordprophess-140720055832-phpapp01
Dsp 1recordprophess-140720055832-phpapp01Dsp 1recordprophess-140720055832-phpapp01
Dsp 1recordprophess-140720055832-phpapp01
Sagar Gore
 
Digital Signal Processing Lab Manual ECE students
Digital Signal Processing Lab Manual ECE studentsDigital Signal Processing Lab Manual ECE students
Digital Signal Processing Lab Manual ECE students
UR11EC098
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdf
HariPrasad314745
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
ShreeDevi42
 
Basic concepts
Basic conceptsBasic concepts
Basic concepts
Syed Zaid Irshad
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
Yamagishi Laboratory, National Institute of Informatics, Japan
 
Module v sp
Module v spModule v sp
Module v sp
Vijaya79
 
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
BasavaRajeshwari2
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdf
LaxmiMobile1
 
Ct signal operations
Ct signal operationsCt signal operations
Ct signal operations
mihir jain
 
SIGNAL OPERATIONS
SIGNAL OPERATIONSSIGNAL OPERATIONS
SIGNAL OPERATIONS
mihir jain
 
l1.ppt
l1.pptl1.ppt
l1.ppt
ImXaib
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004
jeronimored
 
l1.ppt
l1.pptl1.ppt
l1.ppt
ssuser1a62e1
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
Rediet Moges
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Gota Morota
 

Similar to Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering (20)

Sampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptxSampling and Reconstruction (Online Learning).pptx
Sampling and Reconstruction (Online Learning).pptx
 
5.3 dyn algo-i
5.3 dyn algo-i5.3 dyn algo-i
5.3 dyn algo-i
 
SP_BEE2143_C1.pptx
SP_BEE2143_C1.pptxSP_BEE2143_C1.pptx
SP_BEE2143_C1.pptx
 
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
TRIEST: Counting Local and Global Triangles in Fully-Dynamic Streams with Fix...
 
Dsp 1recordprophess-140720055832-phpapp01
Dsp 1recordprophess-140720055832-phpapp01Dsp 1recordprophess-140720055832-phpapp01
Dsp 1recordprophess-140720055832-phpapp01
 
Digital Signal Processing Lab Manual ECE students
Digital Signal Processing Lab Manual ECE studentsDigital Signal Processing Lab Manual ECE students
Digital Signal Processing Lab Manual ECE students
 
Advanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdfAdvanced_DSP_J_G_Proakis.pdf
Advanced_DSP_J_G_Proakis.pdf
 
Course-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdfCourse-Notes__Advanced-DSP.pdf
Course-Notes__Advanced-DSP.pdf
 
Basic concepts
Basic conceptsBasic concepts
Basic concepts
 
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
エンドツーエンド音声合成に向けたNIIにおけるソフトウェア群 ~ TacotronとWaveNetのチュートリアル (Part 1)~
 
Module v sp
Module v spModule v sp
Module v sp
 
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
012675925c0f652bb179b6a33cd3d13b_MIT6_003F11_lec01.pdf
 
Introduction of Algorithm.pdf
Introduction of Algorithm.pdfIntroduction of Algorithm.pdf
Introduction of Algorithm.pdf
 
Ct signal operations
Ct signal operationsCt signal operations
Ct signal operations
 
SIGNAL OPERATIONS
SIGNAL OPERATIONSSIGNAL OPERATIONS
SIGNAL OPERATIONS
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004
 
l1.ppt
l1.pptl1.ppt
l1.ppt
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
Allele Frequencies as Stochastic Processes: Mathematical & Statistical Approa...
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 

Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering

  • 1. Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering Kohei Hayashi (National Institute of Informatics) August 24, 2016 Joint work with • Takanori Maehara (Shizuoka Univ) • Masashi Toyoda (Univ Tokyo) • Ken-ichi Kawarabayashi (National Institute of Informatics) 1 / 21
  • 2. Twitter: The Rapid Stream of Texts SNS with short messages (tweets) Volume 41M users Diversity Covering any topics: news, politics, TV, ... Speed 270K tweets/min A promising data source for topic detection • May discover breaking news and events even faster than news media 2 / 21
  • 3. Two Challenges 1 Topic Detection in Real-time 2 Noise Filtering 3 / 21
  • 4. Noise Filtering Many spam tweets generated by not human • e.g. “tweet buttons” Exaggerate co-occurrence and “hijack” important topics 4 / 21
  • 5. Contributions A streaming topic detection algorithm based on non-negative matrix factorization (NMF) 1 Highly scalable: Able to deal with a 20M×1M sparse matrix/sec 2 Automatic topic hijacking detection & elimination Technical Points 1 Reformulate NMF in a stochastic manner • Stochastic gradient descent (SGD) updates with O(NNZ) time 2 Use of statistical testing • Assume normal topics follow power law 5 / 21
  • 7. Topic Detection by NMF Tweets User-word co-occurrence user word Frequency table 0 0 Topics user word 7 / 21
  • 8. Problem min U∈RI×R + ,V∈RJ×R + fλ(X; U, V), fλ(X; U, V) = 1 2 X − UV 2 F + λ 2 ( U 2 F + V 2 F) Batch Algorithm Repeat until convergence: 1 U ← [U − η Ufλ(X; U, V)]+ 2 V ← [V − η Vfλ(X; U, V)]+ Guaranteed converging to stationary points 8 / 21
  • 9. In Twitter, observe X(t) for each time t • Consider to keep track of ¯X (t) = 1 t t s=1 X(s) Constructing ¯X (t) and solving fλ( ¯X (t) ; U, V): time consuming Key idea: decompose fλ for each t ¯X (t) − UV 2 F = 1 t s X(s) − UV 2 F + const Stochastic Formulation If X(t) is i.i.d. random variable, E[X(t) ] − UV 2 F = E X(t) − UV 2 F + var[X(t) ] Now we can use SGD 9 / 21
  • 10. Streaming Algorithm For t = 1, . . . , T: 1 AU ← U Ufλ, AV ← V Vfλ (metrics) 2 U(t) ← [U(t−1) − ηt Ufλ(X(t) ; U, V(t−1) )A−1 U ]+ 3 V(t) ← [V(t−1) − ηt Vfλ(X(t) ; U(t) , V)A−1 V ]+ Guaranteed converging to the stationary points of fλ( ¯X (t) ; U, V) with some {ηt} and i.i.d. {X(t) } 10 / 21
  • 11. Comparing with the Batch NMF ... Much faster • Update U and V only once! • Update cost: depends on NNZ(X(t) ) NNZ( ¯X (t) ) Memory efficient • Able to discard X(1) , . . . , X(t−1) Smoothing effect U(t) = [(1 − ηt)U(t−1) + ηtX(t) V(t−1) A−1 U ]+ = [(1 − ηt)U(t−1) + ηt argmin U fλ(X(t) ; U, V(t−1) )]+ • The ηt-weighted average between the prev solution and the 1-step NMF solution of X(t) • Mitigates the sparsity of X(t) 11 / 21
  • 13. Problem Setting Goal: Find hijacked topics Idea: Check the distributions of all topics Hypothesis The word dist of a hijacked topic should be different from the word dist of a normal topic • NMF estimates topic-specific word dists as V Xij ∝ p(useri, wordj) ∝ r p(topicr)p(useri|topicr)p(wordj|topicr) ∝ r uirvjr 13 / 21
  • 14. Defining Normal/Hijacked Topics Normal topics: • Many users involve ⇒ Mixing many different vocabularies ⇒ Heavy-tailed (Zipf’s law) Definition: A Normal Topic Topic r is normal if p(rank(word) | topicr) = power(α) 14 / 21
  • 15. Defining Normal/Hijacked Topics (Cont’d) Hijacked topics: • The same word set is repeatedly used ⇒ Uniform probs (almost) no tail Definition: A Hijacked Topic Topic r is hijacked if p(rank(word) | topicr) = step(rankmin) 15 / 21
  • 16. Log-likelihood Ratio Test L(rankmin) = j log step(rankj | rankmin) power(rankj | ˆα) Theorem (Asymptotic normality [Vuong’89]) Let N be # of observed words. Then, L(rankmin)/ √ N converges in distribution to N(0, σ2 ) where σ2 = 1 N N j=1 log step(rankj | rankmin) power(rankj | ˆα) 2 − 1 N L(rankmin) 2 . For r = 1, . . . , R: • Estimate ˆα = argmaxα log power(rank(uk) | α) • For rankmin = 1, . . . , 140: • Compute L(rankmin) • Topic r is hijacked if p-val 0.05 16 / 21
  • 18. Data Japanese Twitter stream • April 15–16, 2013 • 417K users • 1.98M words • 15.3M tweets • 69.4M co-occurrences • Generated X(t) for each 10K co-occurrences 18 / 21
  • 19. Runtime NMF 1% 10% 100% Batch 27.0m 1.9h 16.4h Online [Cao+ 07] 5.6h 9.2h 17.8h Dynamic [Saha+ 12] 16.7h 24h 24h Streaming [this work] 4.0m 21.7m 3.6h w/ Filter [this work] 5.3m 24.1m 3.8h Streaming NMF: ×5–250 faster! • 67K tweets/m • The real-time speed of all JP tweets! (the 2nd most popular language in Twitter) Filtering cost is ignorable 19 / 21
  • 20. Perplexity • Similarity between a topic dist and a target dist • Used Y! headlines1 for the target term dist • Lower is better NMF 1% 10% 100% Batch 9.01E+09 6.32E+06 4.51E+04 Online [Cao+ 07] 1.06E+04 3.25E+04 1.83E+04 Dynamic [Saha+ 12] 6.53E+04 N/A N/A Streaming [this work] 5.65E+07 3.25E+04 8.71E+03 w/ Filter [this work] 5.25E+09 2.40E+04 7.90E+03 Streaming NMF: best at 10% 100% data Topic hijacking filtering improves perplexity 1 http://news.yahoo.co.jp/list 20 / 21
  • 21. Summary Proposed the streaming algorithm for Twitter topic detection • Works in real time (would handle all JP tweets in theory) • Filters spam topics automatically Code: github.com/hayasick/streamingNMF Thank you! 21 / 21
  • 22. Integrated Twitter Topic Detection System For t = 1, . . . , T: 1 Generate X(t) from tweets /∈ Blacklist 2 Update U, V by SGD 3 With some intervals, Detect Topic Hijacking and update Blacklist O(Nt) O(NtR2 + R3 ) O(J) 22 / 21