AINL 2016: Bugaychenko

Dmitry Bugaychenko, Eugeny Malytin
Trend detection at OK

Texts extraction
 Input: raw user activities logs in JSON
 Output: extracted text and metadata
 In-between:
 Unified data collection pipeline: Kafka + Hadoop + Samza
 Different type of objects: posts, photos, videos, comments.
 Large volumes: 50Gb of raw data daily, 20Gb after extraction
 Initial filtering applied: too small documents removed

Language detection
 Input: single extracted text
 Output: text labeled with language
 In-between:
 Based on open source library
https://github.com/optimaize/language-detector
 Math is built on top of trigram distribution, 70+ languages
 Custom language profiles added for:
 Azerbaijani, Armenian, Georgian, Kazakh, Kyrgyz, Tajik, Turkmen, Uzbek
 https://github.com/denniean/language_profiles
 Language distribution priors are important!

Tokenization and canonization
 Input: text with language label
 Output: tokens stream
 In-between:
 Apache Lucene Analyzers (tokenization, stop words removal, stemming)
 Profiles for 23 languages available, including: Russian, Armenian,
Latvian.
 Most ex-USSR languages still missing: Azerbaijani, Belarusian, Georgian,
Kazakh, Kyrgyz, Tajik, Turkmen, Ukrainian, Uzbek etc.

Dictionary extraction
 Input: corpus as a set of token streams
 Output: words index (dictionary)
 In-between:
 Term frequency limits for inclusion
 Previous day dictionary analyzed to keep indices for
common tokens the same
 Large enough to capture multiple languages (1M+)

Vectorization
 Input: tokens stream and dictionary
 Output: sparse vector
 In-between:
 Raw term frequency vectorization

Deduplication
 Input: corpus as a set of vectors
 Output: corpus with duplicates removed
 In-between:
 Cosine as similarity measure (>0.9 => duplicates)
 Random projection hashing to speedup calculation
 18-bit hash, 50% basis sparsity

Current day statistics
 Input: filtered corpus as a set of token streams
 Output: % of documents term or 2-gram where used in
for terms and 2-grams
 In-between:
 2-gram addition
 Aggregation
 Absolute filtration
 Different limits for terms and 2-grams

Accumulated state aggregation
 Input: current day statistics, previous day accumulated state
 Output: Exponentially weighted moving average and variance
for terms and 2-grams (new accumulated state)
 In-between:
 Inclusion limit > exclusion limit
 Different limits for terms and 2-grams
ewmai = (1-a)·ewmai-1 +a· freqi
ewmvi = (1-a)·(ewmvi-1 +a·( freqi -ewmai-1)2
)

Trending terms identification
 Input: Exponentially weighted moving average and
variance for terms and 2-grams
 Output: Trending terms and 2-grams with significance
 In-between:
sigi = max 0,
freqi -max(b,ewmai )
ewmvi + b
æ
è
çç
ö
ø
÷÷

Trending terms clustering
 Input: list of trending terms, corpus as a set of token streams
 Output: trending terms grouped into clusters with high level of
concurrences
 In-between:
 Term-term matrix of normalized pointwise mutual information
 DBSCAN clustering (ELKI implementation) with cosine distance
npmi(x;y) =
log p(x, y) p(x)· p(y)( )
-log p(x, y)( )

Relevant documents extraction
 Input: identified trending term clusters, corpus as a set of
token streams
 Output: set of relevant documents and “spammines” level
for each cluster
 In-between:
 For each document find most relevant cluster by counting
terms
 For each cluster select top liked documents
 Count unique users/groups/IPs relative to overall count

Results visualization
 Input: trending terms clusters with relevant documents
 Output: Nice interactive visualization 
 In-between:
 Add navigation for dates and clusters
 Extract geo location for each document
 Plot on an interactive map
 Display details on hover

Need for speed!
 Trends are valuable only while they are in trend
 Daily batch processing is inherently lagging
 Alternatives:
 Mini-batch
 Streaming!
 Lambda-architecture

Streaming trending terms detection

Not yet there!
 Visualizing just trending terms is not informative
 Clustering required
 Relevant documents extraction required
 Mini-batch model is more appropriate here

Technologies used
 Apache Kafka for data collection
 Apache YARN for resource negotiation
 Apache Spark for batch and mini-batch processing
 Apache Samza for streaming processing
 Apache Lucene for texts preprocessing
 Optimaze languange-detector
 ELKI for clustering

More links
 Language-detector https://github.com/optimaize/language-
detector
 Extra profiles https://github.com/denniean/language_profiles
 Trends math
http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD14-
SigniTrend-preprint.pdf
 NPMI
https://en.wikipedia.org/wiki/Pointwise_mutual_information
 DBSCAN https://en.wikipedia.org/wiki/DBSCAN

AINL 2016: Bugaychenko

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to AINL 2016: Bugaychenko

Similar to AINL 2016: Bugaychenko (20)

More from Lidia Pivovarova

More from Lidia Pivovarova (14)

Recently uploaded

Recently uploaded (20)

AINL 2016: Bugaychenko