SlideShare a Scribd company logo
Twitter content-based Recommendation System
Barcelona Tourist City Monitor & Insights
01.07.2016
#MACHINELEARNING #SPARK #KAFKA #CASSANDRA
Juan Pablo López
Rodica Fazakas
Yulia Zvyagelskaya
Beatriz Martín
BIG DATA MANAGEMENT AND ANALYTICS
POSTGRADUATE COURSE - FINAL PROJECT
The Challenge
Build content-based recommendation system to provide real-time personalized
recommendations to Social Media users and insights visualization for touristic and
smart city sector
The product is addressed to:
● Middle and small companies connected to touristic sector, both of B2B&B2C
model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa,
etc.)
● City and neighborhood public departments and administrations
● Event agencies and managers
● Advertising and marketing agencies
The Challenge
Businesses continue investing budgets to Social Media targeted advertising:
The Challenge
Aims of the project:
● Twitter data collection and management
● Tourists vs. residents classification
● Topic (user interest) modeling
● Recommendation system implementation
● Real-time streaming statistic calculation
● Predictive model application for streaming
The Challenge
Main tasks of the project:
● Design and implement the architecture that is able to scale and measure high
volume data traffic
● Real-time requests response
● Use advanced ML supervised and unsupervised techniques
● Extract valuable relevant information (insights) of managed data to deliver
tangible business results to the customers
● Provide user-friendly visualization and presentation of the extracted
information
Data
Data Source
ENGLISH, FRENCH, RUSSIAN
[41.34,2.03,41.45,2.25]
Tweets geolocated in Barcelona Tweets with Barcelona KW
Barcelona
Sagradafa
MWC
Data Source (amount of data)
[41.34,2.03,41.45,2.25]
All languages: 20.000 tweets/day
Only EN, FR, RU: 7.000 tweets/day
All languages: 250.000 tweets/day
Only EN, FR, RU: 80.000 tweets/day
Barcelona
Sagradafa
MWC
Data Management
Cluster topology
Architecture
- Architecture
Data Collect Layer
Data Collect Layer
Collect
Process
Data Collect Layer: Apache Kafka
Distributed publish-subscribe messaging service
Fault-tolerant
Decoupling, Simplicity, Efficiency
Fast
topics: twittergeobcn, twitterkwbcn, rtstats, rtpredictions
Data Collect Layer
Collect
Process
topics: twittergeobcn, twitterkwbcn
Data Collect Layer
Data Collection: Apache Flume
Processing Analytics Layer
Processing
Analytics
Layer
Batch Processing: Pre Process
● Collect
●
Pre
Process
● Read Geolocated Tweets stored in HDFS
● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..)
● Categorize users (tourist, resident), comparing geolocation of last 200
tweets
● Save in Cassandra for ML processes
Batch Processing: Topic Modelling Process
● Collect
●
TP
Process
Batch Processing: SVM Process
● Collect
●
SVM
Process
Model
Streaming Process
Collect
Stats
Process
topic: twittergeobcn
topic: rtstats
Predict
Process
topic: rtpredictions
Model
API Layer
API Layer
REST
API
Dashboard HTML
Data Analytics
Data Analytics
Tasks:
● Geotagged data tourists vs. residents detection algorithm implementation
● Non-geotagged data tourists vs. residents classification with supervised
machine learning
● Topic (user interest) modeling with unsupervised machine learning
● Recommendation system building
● Statistics calculation
● Visualization
Text Preprocessing
● remove url’s;
● remove @ sign tags from the data;
● remove any number characters, e.g. 1 or 3.14 (removeNumbers);
● remove any punctuation characters (removePunctuation);
● convert all text to lower case (tolower);
● include only words that have a minimum character length of 3;
● remove certain stop words from the data;
● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’
(stemming);
SVM: data tourists vs. residents classification
Challenge: meanwhile only less than 1% is geotagged, the twitter users have to be
classified for tourists and residents to extract further insights and topics of
interests
Aim: build a predictive model to classify non-geotagged twitter texts to distinguish
tourists from residents.
SVM: data tourists vs. residents classification
Dataset: labeled data collection of tweet texts (only from Barcelona) as
independent variable and labels (TRUE for tourist/FALSE for resident) as predictor
variable
Validation protocol:
● Training set (60% of the original dataset) to build up prediction algorithm
● Cross-Validation set (20%) to compare the performances and choose the
algorithm with the best one
● Test set (20%) to apply best prediction algorithm and get an idea about its
performance on unseen data
SVM: data tourists vs. residents classification
Prototyping
● Naive Bayes
● Logistic Regression (Maxent)
● k-NN
● SVM
SVM: data tourists vs. residents classification
Reasons why SVMs perform well for text categorization
SVMs:
● Acknowledge the particular properties of text: high dimensional feature
spaces, few irrelevant features (dense concept vector), and sparse instance
vectors
● Outperform other techniques substantially and significantly
● Eliminate the need for feature selection, making text categorization
considerably easier
● Are robust and do not require much parameter tuning
Topic Modeling
We use topic modelling to automatically detect topics of interest to Twitter users
previously detected as tourists.
● Uncover the hidden topical structure in tweets.
● Assign topics to users.
● Use these assignments to make targeted recommendation
Topic Modeling
Dataset
● Geolocalized tweets from Barcelona, aggregated by identified tourist
Algorithm: baseline Latent Dirichlet Allocation (LDA)
● Unsupervised learning technique
● Extracts key topics. Each topic is an ordered list of representative words.
● Describes each doc in the corpus based on allocation to the extracted topics.
Topic Modelling : LDA Topics
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4
direct love primavera humid photo
work peopl sound wind love
lip happi festiv cloud beauti
june life drink temperatur hotel
book birthdai night finish camp
market hope plai summer centr
design girl live sant view
chang game stage block beach
Recommendation System
user_id topic word recommendation
6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
7296 festivals festiv Festival el Grec, Sonar
1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona
2980 shopping market Boqueria, La Roca Village, Portal del Angel
3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta
DEMO
Thank you!

More Related Content

What's hot

Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
Rebecca Williams
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Aditya Nag
 
Anti-Money Laundering Solution
Anti-Money Laundering SolutionAnti-Money Laundering Solution
Anti-Money Laundering Solution
Sri Ambati
 
Currency recognition system using image processing
Currency recognition system using image processingCurrency recognition system using image processing
Currency recognition system using image processing
Fatima Akhtar
 
Twitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfTwitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdf
Rachanasamal3
 
Ayekart pitch deck
Ayekart pitch deckAyekart pitch deck
Ayekart pitch deck
Tech in Asia
 
Tourism recommendation-system
Tourism recommendation-systemTourism recommendation-system
Tourism recommendation-system
khatrisagar
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
DataWorks Summit
 
Style gan
Style ganStyle gan
Style gan
哲东 郑
 
Digital redefinition of banking banking transformation
Digital redefinition of banking   banking transformationDigital redefinition of banking   banking transformation
Digital redefinition of banking banking transformation
Draup
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
Object tracking
Object trackingObject tracking
Object tracking
Sri vidhya k
 
Machine learning for social media analytics
Machine learning for  social media analyticsMachine learning for  social media analytics
Machine learning for social media analytics
Jenya Terpil
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
Hye-min Ahn
 
Human Emotion Recognition
Human Emotion RecognitionHuman Emotion Recognition
Human Emotion Recognition
Chaitanya Maddala
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
SonuCreation
 
Image captioning
Image captioningImage captioning
Image captioning
Muhammad Zbeedat
 
Fake Image Identification
Fake Image IdentificationFake Image Identification
Fake Image Identification
Venkat Projects
 
Payments and transaction processing systems - Global and Indian Overview
Payments and transaction processing systems - Global and Indian OverviewPayments and transaction processing systems - Global and Indian Overview
Payments and transaction processing systems - Global and Indian OverviewAkshay Kaul
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
RAHUL BHOJWANI
 

What's hot (20)

Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Anti-Money Laundering Solution
Anti-Money Laundering SolutionAnti-Money Laundering Solution
Anti-Money Laundering Solution
 
Currency recognition system using image processing
Currency recognition system using image processingCurrency recognition system using image processing
Currency recognition system using image processing
 
Twitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfTwitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdf
 
Ayekart pitch deck
Ayekart pitch deckAyekart pitch deck
Ayekart pitch deck
 
Tourism recommendation-system
Tourism recommendation-systemTourism recommendation-system
Tourism recommendation-system
 
Landuse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep LearningLanduse Classification from Satellite Imagery using Deep Learning
Landuse Classification from Satellite Imagery using Deep Learning
 
Style gan
Style ganStyle gan
Style gan
 
Digital redefinition of banking banking transformation
Digital redefinition of banking   banking transformationDigital redefinition of banking   banking transformation
Digital redefinition of banking banking transformation
 
Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
Object tracking
Object trackingObject tracking
Object tracking
 
Machine learning for social media analytics
Machine learning for  social media analyticsMachine learning for  social media analytics
Machine learning for social media analytics
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Human Emotion Recognition
Human Emotion RecognitionHuman Emotion Recognition
Human Emotion Recognition
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Image captioning
Image captioningImage captioning
Image captioning
 
Fake Image Identification
Fake Image IdentificationFake Image Identification
Fake Image Identification
 
Payments and transaction processing systems - Global and Indian Overview
Payments and transaction processing systems - Global and Indian OverviewPayments and transaction processing systems - Global and Indian Overview
Payments and transaction processing systems - Global and Indian Overview
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 

Viewers also liked

Taller de Text Mining en Twitter con R
Taller de Text Mining en Twitter con RTaller de Text Mining en Twitter con R
Taller de Text Mining en Twitter con R
Beatriz Martín @zigiella
 
Machine Learning a lo berserker - Software Craftsmanship Barcelona 2016
Machine Learning a lo berserker  - Software Craftsmanship Barcelona 2016Machine Learning a lo berserker  - Software Craftsmanship Barcelona 2016
Machine Learning a lo berserker - Software Craftsmanship Barcelona 2016
Beatriz Martín @zigiella
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
joelcrabb
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
DataStax Academy
 
Serverless Architecture
Serverless ArchitectureServerless Architecture
Serverless Architecture
Lena Barinova
 
Twitter - Architecture and Scalability lessons
Twitter - Architecture and Scalability lessonsTwitter - Architecture and Scalability lessons
Twitter - Architecture and Scalability lessons
Aditya Rao
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
Zaloni
 

Viewers also liked (8)

Taller de Text Mining en Twitter con R
Taller de Text Mining en Twitter con RTaller de Text Mining en Twitter con R
Taller de Text Mining en Twitter con R
 
Machine Learning a lo berserker - Software Craftsmanship Barcelona 2016
Machine Learning a lo berserker  - Software Craftsmanship Barcelona 2016Machine Learning a lo berserker  - Software Craftsmanship Barcelona 2016
Machine Learning a lo berserker - Software Craftsmanship Barcelona 2016
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Serverless Architecture
Serverless ArchitectureServerless Architecture
Serverless Architecture
 
Twitter - Architecture and Scalability lessons
Twitter - Architecture and Scalability lessonsTwitter - Architecture and Scalability lessons
Twitter - Architecture and Scalability lessons
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 

Similar to Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
Rob Winters
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Alexey Zinoviev
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
vitm11
 
Bharat resume
Bharat resumeBharat resume
Bharat resume
bharatrana123456
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
Data Science Milan
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
Himadri Mishra
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
Ivo Andreev
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
WithTheBest
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
gdgsurrey
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Model Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying TogetherModel Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying Together
Iosif Itkin
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
User Behavior Hashing for Audience Expansion
User Behavior Hashing for Audience ExpansionUser Behavior Hashing for Audience Expansion
User Behavior Hashing for Audience Expansion
Databricks
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
Alexey Zinoviev
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
Dassana Wijesekara
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
Insight Technology, Inc.
 
Tarun datascientist affle
Tarun datascientist affleTarun datascientist affle
Tarun datascientist affle
Tarun Aditya
 
Miso-McGill
Miso-McGillMiso-McGill
Miso-McGill
miso_uam
 

Similar to Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning (20)

Building data "Py-pelines"
Building data "Py-pelines"Building data "Py-pelines"
Building data "Py-pelines"
 
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
Python's slippy path and Tao of thick Pandas: give my data, Rrrrr...
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Bharat resume
Bharat resumeBharat resume
Bharat resume
 
Demystifying Data Science
Demystifying Data ScienceDemystifying Data Science
Demystifying Data Science
 
Automatic Machine Learning, AutoML
Automatic Machine Learning, AutoMLAutomatic Machine Learning, AutoML
Automatic Machine Learning, AutoML
 
The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?The Data Science Process - Do we need it and how to apply?
The Data Science Process - Do we need it and how to apply?
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
MOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDCMOPs & ML Pipelines on GCP - Session 6, RGDC
MOPs & ML Pipelines on GCP - Session 6, RGDC
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Model Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying TogetherModel Driven Developing & Model Based Checking: Applying Together
Model Driven Developing & Model Based Checking: Applying Together
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
User Behavior Hashing for Audience Expansion
User Behavior Hashing for Audience ExpansionUser Behavior Hashing for Audience Expansion
User Behavior Hashing for Audience Expansion
 
Joker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data ScientistJoker'14 Java as a fundamental working tool of the Data Scientist
Joker'14 Java as a fundamental working tool of the Data Scientist
 
WSO2 Workshop Sydney 2016 - Analytics
WSO2 Workshop Sydney 2016 -  AnalyticsWSO2 Workshop Sydney 2016 -  Analytics
WSO2 Workshop Sydney 2016 - Analytics
 
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
[db tech showcase Tokyo 2017] C34: Replacing Oracle Database at DBS Bank ~Ora...
 
Tarun datascientist affle
Tarun datascientist affleTarun datascientist affle
Tarun datascientist affle
 
Miso-McGill
Miso-McGillMiso-McGill
Miso-McGill
 

More from Beatriz Martín @zigiella

Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdfHola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
Beatriz Martín @zigiella
 
MENOS METAVERSO Y MÁS WEB3 (v1)
MENOS METAVERSO Y MÁS WEB3 (v1)MENOS METAVERSO Y MÁS WEB3 (v1)
MENOS METAVERSO Y MÁS WEB3 (v1)
Beatriz Martín @zigiella
 
Qué Metaverso ni qué Metaversa (y web3)
Qué Metaverso ni qué Metaversa (y web3)Qué Metaverso ni qué Metaversa (y web3)
Qué Metaverso ni qué Metaversa (y web3)
Beatriz Martín @zigiella
 
Machine learning: camino a Skynet #Bilbostack2018
Machine learning: camino a Skynet #Bilbostack2018Machine learning: camino a Skynet #Bilbostack2018
Machine learning: camino a Skynet #Bilbostack2018
Beatriz Martín @zigiella
 
Data Science, Big Data, Machine Learning, ecosistema de conceptos
Data Science, Big Data, Machine Learning, ecosistema de conceptosData Science, Big Data, Machine Learning, ecosistema de conceptos
Data Science, Big Data, Machine Learning, ecosistema de conceptos
Beatriz Martín @zigiella
 
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
Beatriz Martín @zigiella
 
Deja de decir: "En mi máquina sí funciona"
Deja de decir: "En mi máquina sí funciona"Deja de decir: "En mi máquina sí funciona"
Deja de decir: "En mi máquina sí funciona"
Beatriz Martín @zigiella
 
Cómo evitar que se vaya al carajo tu implantación de agile
Cómo evitar que se vaya al carajo tu implantación de agileCómo evitar que se vaya al carajo tu implantación de agile
Cómo evitar que se vaya al carajo tu implantación de agile
Beatriz Martín @zigiella
 
Cultura organizacional y Liderazgo
Cultura organizacional y LiderazgoCultura organizacional y Liderazgo
Cultura organizacional y Liderazgo
Beatriz Martín @zigiella
 

More from Beatriz Martín @zigiella (9)

Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdfHola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
Hola, Kraken. IA generativa - @zigiella - Julio 2023.pdf
 
MENOS METAVERSO Y MÁS WEB3 (v1)
MENOS METAVERSO Y MÁS WEB3 (v1)MENOS METAVERSO Y MÁS WEB3 (v1)
MENOS METAVERSO Y MÁS WEB3 (v1)
 
Qué Metaverso ni qué Metaversa (y web3)
Qué Metaverso ni qué Metaversa (y web3)Qué Metaverso ni qué Metaversa (y web3)
Qué Metaverso ni qué Metaversa (y web3)
 
Machine learning: camino a Skynet #Bilbostack2018
Machine learning: camino a Skynet #Bilbostack2018Machine learning: camino a Skynet #Bilbostack2018
Machine learning: camino a Skynet #Bilbostack2018
 
Data Science, Big Data, Machine Learning, ecosistema de conceptos
Data Science, Big Data, Machine Learning, ecosistema de conceptosData Science, Big Data, Machine Learning, ecosistema de conceptos
Data Science, Big Data, Machine Learning, ecosistema de conceptos
 
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
Nuevos hábitos de consumo => Reinventando servicios, pensando en el usuario (...
 
Deja de decir: "En mi máquina sí funciona"
Deja de decir: "En mi máquina sí funciona"Deja de decir: "En mi máquina sí funciona"
Deja de decir: "En mi máquina sí funciona"
 
Cómo evitar que se vaya al carajo tu implantación de agile
Cómo evitar que se vaya al carajo tu implantación de agileCómo evitar que se vaya al carajo tu implantación de agile
Cómo evitar que se vaya al carajo tu implantación de agile
 
Cultura organizacional y Liderazgo
Cultura organizacional y LiderazgoCultura organizacional y Liderazgo
Cultura organizacional y Liderazgo
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Twitter + Lambda Architecture (Spark, Kafka, FLume, Cassandra) + Machine Learning

  • 1. Twitter content-based Recommendation System Barcelona Tourist City Monitor & Insights 01.07.2016 #MACHINELEARNING #SPARK #KAFKA #CASSANDRA Juan Pablo López Rodica Fazakas Yulia Zvyagelskaya Beatriz Martín BIG DATA MANAGEMENT AND ANALYTICS POSTGRADUATE COURSE - FINAL PROJECT
  • 2. The Challenge Build content-based recommendation system to provide real-time personalized recommendations to Social Media users and insights visualization for touristic and smart city sector The product is addressed to: ● Middle and small companies connected to touristic sector, both of B2B&B2C model (Leisure/Travel, Tour operators, tourist online portals, Retail, HoReCa, etc.) ● City and neighborhood public departments and administrations ● Event agencies and managers ● Advertising and marketing agencies
  • 3. The Challenge Businesses continue investing budgets to Social Media targeted advertising:
  • 4. The Challenge Aims of the project: ● Twitter data collection and management ● Tourists vs. residents classification ● Topic (user interest) modeling ● Recommendation system implementation ● Real-time streaming statistic calculation ● Predictive model application for streaming
  • 5. The Challenge Main tasks of the project: ● Design and implement the architecture that is able to scale and measure high volume data traffic ● Real-time requests response ● Use advanced ML supervised and unsupervised techniques ● Extract valuable relevant information (insights) of managed data to deliver tangible business results to the customers ● Provide user-friendly visualization and presentation of the extracted information
  • 6.
  • 8. Data Source ENGLISH, FRENCH, RUSSIAN [41.34,2.03,41.45,2.25] Tweets geolocated in Barcelona Tweets with Barcelona KW Barcelona Sagradafa MWC
  • 9. Data Source (amount of data) [41.34,2.03,41.45,2.25] All languages: 20.000 tweets/day Only EN, FR, RU: 7.000 tweets/day All languages: 250.000 tweets/day Only EN, FR, RU: 80.000 tweets/day Barcelona Sagradafa MWC
  • 16. Data Collect Layer: Apache Kafka Distributed publish-subscribe messaging service Fault-tolerant Decoupling, Simplicity, Efficiency Fast topics: twittergeobcn, twitterkwbcn, rtstats, rtpredictions
  • 17. Data Collect Layer Collect Process topics: twittergeobcn, twitterkwbcn
  • 22. Batch Processing: Pre Process ● Collect ● Pre Process ● Read Geolocated Tweets stored in HDFS ● Clean Tweet Text (lowercase, numbers, spaces,tabs,etc..) ● Categorize users (tourist, resident), comparing geolocation of last 200 tweets ● Save in Cassandra for ML processes
  • 23. Batch Processing: Topic Modelling Process ● Collect ● TP Process
  • 24. Batch Processing: SVM Process ● Collect ● SVM Process Model
  • 25. Streaming Process Collect Stats Process topic: twittergeobcn topic: rtstats Predict Process topic: rtpredictions Model
  • 30. Data Analytics Tasks: ● Geotagged data tourists vs. residents detection algorithm implementation ● Non-geotagged data tourists vs. residents classification with supervised machine learning ● Topic (user interest) modeling with unsupervised machine learning ● Recommendation system building ● Statistics calculation ● Visualization
  • 31. Text Preprocessing ● remove url’s; ● remove @ sign tags from the data; ● remove any number characters, e.g. 1 or 3.14 (removeNumbers); ● remove any punctuation characters (removePunctuation); ● convert all text to lower case (tolower); ● include only words that have a minimum character length of 3; ● remove certain stop words from the data; ● reduce words to their ‘stems’, e.g. ‘walk’ is the stem of ‘walking’ and ‘walked’ (stemming);
  • 32. SVM: data tourists vs. residents classification Challenge: meanwhile only less than 1% is geotagged, the twitter users have to be classified for tourists and residents to extract further insights and topics of interests Aim: build a predictive model to classify non-geotagged twitter texts to distinguish tourists from residents.
  • 33. SVM: data tourists vs. residents classification Dataset: labeled data collection of tweet texts (only from Barcelona) as independent variable and labels (TRUE for tourist/FALSE for resident) as predictor variable Validation protocol: ● Training set (60% of the original dataset) to build up prediction algorithm ● Cross-Validation set (20%) to compare the performances and choose the algorithm with the best one ● Test set (20%) to apply best prediction algorithm and get an idea about its performance on unseen data
  • 34. SVM: data tourists vs. residents classification Prototyping ● Naive Bayes ● Logistic Regression (Maxent) ● k-NN ● SVM
  • 35. SVM: data tourists vs. residents classification Reasons why SVMs perform well for text categorization SVMs: ● Acknowledge the particular properties of text: high dimensional feature spaces, few irrelevant features (dense concept vector), and sparse instance vectors ● Outperform other techniques substantially and significantly ● Eliminate the need for feature selection, making text categorization considerably easier ● Are robust and do not require much parameter tuning
  • 36. Topic Modeling We use topic modelling to automatically detect topics of interest to Twitter users previously detected as tourists. ● Uncover the hidden topical structure in tweets. ● Assign topics to users. ● Use these assignments to make targeted recommendation
  • 37. Topic Modeling Dataset ● Geolocalized tweets from Barcelona, aggregated by identified tourist Algorithm: baseline Latent Dirichlet Allocation (LDA) ● Unsupervised learning technique ● Extracts key topics. Each topic is an ordered list of representative words. ● Describes each doc in the corpus based on allocation to the extracted topics.
  • 38. Topic Modelling : LDA Topics Topic 0 Topic 1 Topic 2 Topic 3 Topic 4 direct love primavera humid photo work peopl sound wind love lip happi festiv cloud beauti june life drink temperatur hotel book birthdai night finish camp market hope plai summer centr design girl live sant view chang game stage block beach
  • 39. Recommendation System user_id topic word recommendation 6448 sports game Bowling Pedralbes, Camp Nou, Museu del FC Barcelona 7296 festivals festiv Festival el Grec, Sonar 1239 sports plai Bowling Pedralbes, Camp Nou, Museu del FC Barcelona 2980 shopping market Boqueria, La Roca Village, Portal del Angel 3501 nature beach Font Magica, Park Guell, Playa de la Barceloneta
  • 40. DEMO