2. French Election 2017 – Twitter Data Analysis
MS Engineering
in Computer Science
Data Mining
2016/2017
3. French Election 2017 – Twitter Data Analysis
THIS IS THE STORY!
• The 2017 French presidential election was held on 23 April and
7 May 2017.
• As no candidate won a majority in the first round, a run-off was
held between the top two candidates, Emmanuel Macron of
“En Marche!” and Marine Le Pen of the “National Front (FN)”,
which Macron won by a decisive margin on 7 May.
4. French Election 2017 – Twitter Data Analysis
OUTLINE
1.Presentation of Data
1.Tableau
1.Clustering Analysis
1.GDELT PROJECT
7. French Election 2017 – Twitter Data Analysis
TWITTER DATASET
• Obtained from kaggle website in .sqlite format
• Standard fields from twitter datasets extracted with stream api.
• Candidates mentions
• Load, preprocess and enrich the dataset
• Pass hour from millisecond to %D %H:%M format
• Map Location parameter to City-Country using GeoText
• Sentiment analysis using vaderSentiment package python
9. LET’S START WITH TABLEAU!
• Produces interactive data visualization products focused
on business intelligence
• Version: Tableau Desktop
• Data storage: locally, using Exctract
• Connect with Excel files
French Election 2017 – Twitter Data Analysis
10. French Election 2017 – Twitter Data Analysis
WHO, WHEN AND WHERE
PEOPLE TWEETED?
28. French Election 2017 – Twitter Data Analysis
CLUSTER ANALYSIS
• Main objective to group objects
• Document Cluster
• Sklearn and nltk packages
• Filter retweets (235021 tweets)
30. K-Means
• Create a TF-IDF
• Term frequency
• Inverse Document freq
• Run multiple K-Means with different K ∈ [2,8]
French Election 2017 – Twitter Data Analysis
31. French Election 2017 – Twitter Data Analysis
K-Means
Num_cluster 3 Data Distribution
Cluster0 -> 168032
Cluster1 -> 30611
Cluster2 -> 36378
32. French Election 2017 – Twitter Data Analysis
K-Means
Num_cluster 6 Data Distribution
Cluster0 -> 89636
Cluster1 -> 34058
Cluster2 -> 25070
Cluster3 -> 29517
Cluster4 -> 45524
Cluster5 -> 11216
33. 33
K-Means
• Time execution
• 1:30-2 min per run
• There is one dominant cluster
• Repeated terms: macron, lepen, france
• Sparse data (According to cosine similarity)
• Few information about the cluster
French Election 2017 – Twitter Data Analysis
35. French Election 2017 – Twitter Data Analysis
TOPIC MODELLING
• Discover the abstract topic from documents
• Two different approaches
• Latent Semantic Analysis
• Non-Negative Matrix factorization
36. French Election 2017 – Twitter Data Analysis
Latent Semantic Analysis
• Analyzing relationships between documents
• SVD technique for reducing the space
• TruncatedSVD
37. French Election 2017 – Twitter Data Analysis
LSA
Num_cluster 2 Data Distribution
Cluster0 -> 135344
Cluster1 -> 99677
38. 3838
Num_cluster 4 Data Distribution
Cluster0 -> 39063
Cluster1 -> 36906
Cluster2 -> 124231
Cluster3 -> 34821
LSA
French Election 2017 – Twitter Data Analysis
39. 3939French Election 2017 – Twitter Data Analysis
LSA
Num_cluster 5 Data Distribution
Cluster0 -> 37545
Cluster1 -> 36851
Cluster2 -> 75845
Cluster3 -> 34046
Cluster4 -> 50734
40. 40
LSA
• Reduce time execution
• 20 secs aprox per run
• Maintain one big cluster for n_cluster < 5
• Data is more concentrated and distributed
• Still have few information about the clusters
French Election 2017 – Twitter Data Analysis
41. French Election 2017 – Twitter Data Analysis
Non-Negative Matrix
Factorization
• Decompose in two matrix with k topics
• V -> term-document matrix
• W -> term-topic matrix
• H -> topic-document matrix
42. French Election 2017 – Twitter Data Analysis
Non-Negative Matrix
Factorization
Num_cluster 3 Data Distribution
Cluster0 -> 87637
Cluster1 -> 74663
Cluster2 -> 72721
43. 43French Election 2017 – Twitter Data Analysis
Non-Negative Matrix
Factorization
Num_cluster 4 Data Distribution
Cluster0 -> 67791
Cluster1 -> 66108
Cluster2 -> 59462
Cluster3 -> 41660
44. French Election 2017 – Twitter Data Analysis
Non-Negative Matrix
Factorization
Num_cluster 6 Data Distribution
Cluster0 -> 59772
Cluster1 -> 57190
Cluster2 -> 50852
Cluster3 -> 37733
Cluster4 -> 16251
Cluster5 -> 13223
45. French Election 2017 – Twitter Data Analysis
NMF TOPICS
• Information about the topics
• Top words
• Not all the terms are repeated
• Terms are more distributed
46. French Election 2017 – Twitter Data Analysis
Non-Negative Matrix
Factorization - KMeans
• Reduce time execution
• Less than 10 secs
• Data is distributed among the clusters
• More information about the topics and clustering
• Some sparse data
48. French Election 2017 – Twitter Data Analysis
GDELT PROJECT
• GDELT monitors the world's news media from nearly
every corner of every country
in print, broadcast, and web formats, in over 100
languages, every moment of every day
• Uses natural language and data mining algorithms
49. French Election 2017 – Twitter Data Analysis
GDELT DATASET
• Monitored data from 1979 to nowadays
• New information added each 15 minutes
• Total set divided by dates
• 57 fields (date,actors,actions,events,location,etc)
50. French Election 2017 – Twitter Data Analysis
APACHE FLINK
• Apache Flink® is an open-source stream processing
framework for distributed, high-performing, always-
available, and accurate data streaming applications
• Java
51. French Election 2017 – Twitter Data Analysis
APPROXIMATION
• Perform analysis of the French events during the elections
• Find some patterns between events and tweets
• Important fields: <date,actor,action,event,location>
52. French Election 2017 – Twitter Data Analysis
ANALYSIS
• Evolution of the events each day
• Trending leaders
• Trending events perday
53. French Election 2017 – Twitter Data Analysis
MapReduce
• Java API
• Multidimensional tuples
• MapReduce transformations
54. French Election 2017 – Twitter Data Analysis
EVOLUTION of EVENTS
• Nº of events perday during elections happened in France
55. French Election 2017 – Twitter Data Analysis
EVOLUTION of EVENTS
• Evolution nº of tweets perday
57. French Election 2017 – Twitter Data Analysis
TRENDING LEADERS
• Leaders with higher number of mentions
58. French Election 2017 – Twitter Data Analysis
TRENDING LEADERS
• 1000 entries -> special results in the first 30 registers
59. French Election 2017 – Twitter Data Analysis
GDELT Conclusion
• Politics themes happen same days of tweet peaks
• They are among the 30 most mentioned
• Big presence of politics themes in media could increment
the use of twitter
• François Fillon the unique candidate among 1000 most
mentioned leaders
60. French Election 2017 – Twitter Data Analysis
THANKYOU!
LINKS
• CODE:
•https://github.com/dieguer22/DataMiningProject