Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Twitter content-based Recommendation System
Barcelona Tourist City Monitor & Insights
01.07.2016
#MACHINELEARNING #SPARK #KAFKA #CASSANDRA
Juan Pablo López
Rodica Fazakas
Yulia Zvyagelskaya
Beatriz Martín
BIG DATA MANAGEMENT AND ANALYTICS
POSTGRADUATE COURSE - FINAL PROJECT

The Challenge
Build content-based recommendation system to provide real-time personalized
recommendations to Social Media users and insights visualization for touristic and
smart city sector
The product is addressed to:
● Middle and small companies connected to touristic sector, both of B2B&B2C
model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa,
etc.)
● City and neighborhood public departments and administrations
● Event agencies and managers
● Advertising and marketing agencies

The Challenge
Businesses continue investing budgets to Social Media targeted advertising:

The Challenge
Aims of the project:
● Twitter data collection and management
● Tourists vs. residents classification
● Topic (user interest) modeling
● Recommendation system implementation
● Real-time streaming statistic calculation
● Predictive model application for streaming

The Challenge
Main tasks of the project:
● Design and implement the architecture that is able to scale and measure high
volume data traffic
● Real-time requests response
● Use advanced ML supervised and unsupervised techniques
● Extract valuable relevant information (insights) of managed data to deliver
tangible business results to the customers
● Provide user-friendly visualization and presentation of the extracted
information

Data Source
ENGLISH, FRENCH, RUSSIAN
[41.34,2.03,41.45,2.25]
Tweets geolocated in Barcelona Tweets with Barcelona KW
Barcelona
Sagradafa
MWC

Data Source (amount of data)
[41.34,2.03,41.45,2.25]
All languages: 20.000 tweets/day
Only EN, FR, RU: 7.000 tweets/day
All languages: 250.000 tweets/day
Only EN, FR, RU: 80.000 tweets/day
Barcelona
Sagradafa
MWC

Data Collect Layer
Collect
Process

Data Collect Layer: Apache Kafka
Distributed publish-subscribe messaging service
Fault-tolerant
Decoupling, Simplicity, Efficiency
Fast
topics: twittergeobcn, twitterkwbcn, rtstats, rtpredictions

Data Collect Layer
Collect
Process
topics: twittergeobcn, twitterkwbcn

Batch Processing: Pre Process
● Collect
●
Pre
Process
● Read Geolocated Tweets stored in HDFS
● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..)
● Categorize users (tourist, resident), comparing geolocation of last 200
tweets
● Save in Cassandra for ML processes

Batch Processing: Topic Modelling Process
● Collect
●
TP
Process

Batch Processing: SVM Process
● Collect
●
SVM
Process
Model

Streaming Process
Collect
Stats
Process
topic: twittergeobcn
topic: rtstats
Predict
Process
topic: rtpredictions
Model

Data Analytics
Tasks:
● Geotagged data tourists vs. residents detection algorithm implementation
● Non-geotagged data tourists vs. residents classification with supervised
machine learning
● Topic (user interest) modeling with unsupervised machine learning
● Recommendation system building
● Statistics calculation
● Visualization

Text Preprocessing
● remove url’s;
● remove @ sign tags from the data;
● remove any number characters, e.g. 1 or 3.14 (removeNumbers);
● remove any punctuation characters (removePunctuation);
● convert all text to lower case (tolower);
● include only words that have a minimum character length of 3;
● remove certain stop words from the data;
● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’
(stemming);

SVM: data tourists vs. residents classification
Challenge: meanwhile only less than 1% is geotagged, the twitter users have to be
classified for tourists and residents to extract further insights and topics of
interests
Aim: build a predictive model to classify non-geotagged twitter texts to distinguish
tourists from residents.

Dataset: labeled data collection of tweet texts (only from Barcelona) as
independent variable and labels (TRUE for tourist/FALSE for resident) as predictor
variable
Validation protocol:
● Training set (60% of the original dataset) to build up prediction algorithm
● Cross-Validation set (20%) to compare the performances and choose the
algorithm with the best one
● Test set (20%) to apply best prediction algorithm and get an idea about its
performance on unseen data

Prototyping
● Naive Bayes
● Logistic Regression (Maxent)
● k-NN
● SVM

Reasons why SVMs perform well for text categorization
SVMs:
● Acknowledge the particular properties of text: high dimensional feature
spaces, few irrelevant features (dense concept vector), and sparse instance
vectors
● Outperform other techniques substantially and significantly
● Eliminate the need for feature selection, making text categorization
considerably easier
● Are robust and do not require much parameter tuning

Topic Modeling
We use topic modelling to automatically detect topics of interest to Twitter users
previously detected as tourists.
● Uncover the hidden topical structure in tweets.
● Assign topics to users.
● Use these assignments to make targeted recommendation

Topic Modeling
Dataset
● Geolocalized tweets from Barcelona, aggregated by identified tourist
Algorithm: baseline Latent Dirichlet Allocation (LDA)
● Unsupervised learning technique
● Extracts key topics. Each topic is an ordered list of representative words.
● Describes each doc in the corpus based on allocation to the extracted topics.

Topic Modelling : LDA Topics
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4
direct love primavera humid photo
work peopl sound wind love
lip happi festiv cloud beauti
june life drink temperatur hotel
book birthdai night finish camp
market hope plai summer centr
design girl live sant view
chang game stage block beach

Recommendation System
user_id topic word recommendation
6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
7296 festivals festiv Festival el Grec, Sonar
1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
2980 shopping market Boqueria, La Roca Village, Portal del Angel
3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta

Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Similar to Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning (20)

More from Beatriz Martín @zigiella

More from Beatriz Martín @zigiella (9)

Recently uploaded

Recently uploaded (20)

Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning