3. Texts extraction
Input: raw user activities logs in JSON
Output: extracted text and metadata
In-between:
Unified data collection pipeline: Kafka + Hadoop + Samza
Different type of objects: posts, photos, videos, comments.
Large volumes: 50Gb of raw data daily, 20Gb after extraction
Initial filtering applied: too small documents removed
4. Language detection
Input: single extracted text
Output: text labeled with language
In-between:
Based on open source library
https://github.com/optimaize/language-detector
Math is built on top of trigram distribution, 70+ languages
Custom language profiles added for:
Azerbaijani, Armenian, Georgian, Kazakh, Kyrgyz, Tajik, Turkmen, Uzbek
https://github.com/denniean/language_profiles
Language distribution priors are important!
5. Tokenization and canonization
Input: text with language label
Output: tokens stream
In-between:
Apache Lucene Analyzers (tokenization, stop words removal, stemming)
Profiles for 23 languages available, including: Russian, Armenian,
Latvian.
Most ex-USSR languages still missing: Azerbaijani, Belarusian, Georgian,
Kazakh, Kyrgyz, Tajik, Turkmen, Ukrainian, Uzbek etc.
6. Dictionary extraction
Input: corpus as a set of token streams
Output: words index (dictionary)
In-between:
Term frequency limits for inclusion
Previous day dictionary analyzed to keep indices for
common tokens the same
Large enough to capture multiple languages (1M+)
7. Vectorization
Input: tokens stream and dictionary
Output: sparse vector
In-between:
Raw term frequency vectorization
8. Deduplication
Input: corpus as a set of vectors
Output: corpus with duplicates removed
In-between:
Cosine as similarity measure (>0.9 => duplicates)
Random projection hashing to speedup calculation
18-bit hash, 50% basis sparsity
9. Current day statistics
Input: filtered corpus as a set of token streams
Output: % of documents term or 2-gram where used in
for terms and 2-grams
In-between:
2-gram addition
Aggregation
Absolute filtration
Different limits for terms and 2-grams
10. Accumulated state aggregation
Input: current day statistics, previous day accumulated state
Output: Exponentially weighted moving average and variance
for terms and 2-grams (new accumulated state)
In-between:
Inclusion limit > exclusion limit
Different limits for terms and 2-grams
ewmai = (1-a)·ewmai-1 +a· freqi
ewmvi = (1-a)·(ewmvi-1 +a·( freqi -ewmai-1)2
)
11. Trending terms identification
Input: Exponentially weighted moving average and
variance for terms and 2-grams
Output: Trending terms and 2-grams with significance
In-between:
sigi = max 0,
freqi -max(b,ewmai )
ewmvi + b
æ
è
çç
ö
ø
÷÷
13. Trending terms clustering
Input: list of trending terms, corpus as a set of token streams
Output: trending terms grouped into clusters with high level of
concurrences
In-between:
Term-term matrix of normalized pointwise mutual information
DBSCAN clustering (ELKI implementation) with cosine distance
npmi(x;y) =
log p(x, y) p(x)· p(y)( )
-log p(x, y)( )
15. Relevant documents extraction
Input: identified trending term clusters, corpus as a set of
token streams
Output: set of relevant documents and “spammines” level
for each cluster
In-between:
For each document find most relevant cluster by counting
terms
For each cluster select top liked documents
Count unique users/groups/IPs relative to overall count
16. Results visualization
Input: trending terms clusters with relevant documents
Output: Nice interactive visualization
In-between:
Add navigation for dates and clusters
Extract geo location for each document
Plot on an interactive map
Display details on hover
19. Need for speed!
Trends are valuable only while they are in trend
Daily batch processing is inherently lagging
Alternatives:
Mini-batch
Streaming!
Lambda-architecture
21. Not yet there!
Visualizing just trending terms is not informative
Clustering required
Relevant documents extraction required
Mini-batch model is more appropriate here
24. Technologies used
Apache Kafka for data collection
Apache YARN for resource negotiation
Apache Spark for batch and mini-batch processing
Apache Samza for streaming processing
Apache Lucene for texts preprocessing
Optimaze languange-detector
ELKI for clustering
25. More links
Language-detector https://github.com/optimaize/language-
detector
Extra profiles https://github.com/denniean/language_profiles
Trends math
http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD14-
SigniTrend-preprint.pdf
NPMI
https://en.wikipedia.org/wiki/Pointwise_mutual_information
DBSCAN https://en.wikipedia.org/wiki/DBSCAN