Kurniawan2016

Real-time Traffic Classification with Twitter Data
Mining
Dwi Aji Kurniawan1
, Sunu Wibirama2
, Noor Akhmad Setiawan3
Department of Electrical Engineering and Information Technology
Universitas Gadjah Mada
Indonesia
1
dwi.aji.k@mail.ugm.ac.id, 2
sunu@ugm.ac.id, 3
noorwewe@ugm.ac.id
Abstract— The growth of vehicles in Yogyakarta Province,
Indonesia is not proportional to the growth of roads. This problem
causes severe traffic jam in many main roads. Common traffic
anomalies detection using surveillance camera requires manpower
and costly, while traffic anomalies detection with crowdsourcing
mobile applications are mostly owned by private. This research
aims to develop a real-time traffic classification by harnessing the
power of social network data, Twitter. In this study, Twitter data
are processed to the stages of preprocessing, feature extraction,
and tweet classification. This study compares classification
performance of three machine learning algorithms, namely Naive
Bayes (NB), Support Vector Machine (SVM), and Decision Tree
(DT). Experimental results show that SVM algorithm produced
the best performance among the other algorithms with 99.77%
and 99.87% of classification accuracy in balanced and imbalanced
data, respectively. This research implies that social network
service may be used as an alternative source for traffic anomalies
detection by providing information of traffic flow condition in
real-time.
Keywords— traffic, data mining in Twitter, social network, tweet
classification, machine learning.
I. INTRODUCTION
The growth of vehicles in big cities is not proportional to the
growth of roads. Sooner or later, roads in big cities will be
increasingly jammed. Installation of surveillance cameras in
some streets and intersections has been common approach of
real-time traffic anomalies detection. Nevertheless, this
approach requires manpower to observe the cameras and to
locate spatial position of the traffic information. On the other
side, location-based crowdsourcing technology such as Waze
(https://www.waze.com) is currently used as driver’s companion
for route finding. However, Waze is a proprietary service, thus
the authorities may find it difficult to get access to the data.
Social network service has been used to detect traffic
anomalies and events. An approach developed by Sakaki et al.
[1] shows that Twitter detect an event faster than traditional
media. D'Andrea et al. [2] compared the performance of seven
classification algorithms to classify Italian tweets. Sakaki et al.
[3] proposed four stages to detect locations in Japanese tweets.
Gu et al. [4] classify tweets about traffic in the city of Pittsburgh
and Philadelphia (USA) using the Semi Naive Bayes (SNB) and
Supervised Latent Dirichlet Allocation (sLDA). Gutiérrez et al.
[5] classify English tweets using Support Vector Machine
(SVM) classifier in RapidMiner software. These studies used
local language as their source of information. Moreover, these
studies focused more on traffic events such as accidents, road
work, snow, road closures, but did not focus on the state of
traffic flow, such as traffic jams, crowded, crowded, crowded
smooth, and smooth.
There were also some studies in Indonesia related to the use
of social networking as traffic conditions monitoring. Research
by Wibisono et al. [6] in Jakarta used the concept of Learning
Vector Quantization (LVQ) neural network to classify tweets
into three classes: low traffic flow, medium traffic flow, and
high traffic flow. The system developed by Wibisono et al. [6]
used tweet from the official account of traffic officers as a data
source. Another study in Bandung by Rodiyansyah and Winarko
[7] classified four classes (Loss, Current, Unknown, and Model)
on traffic tweets using Naive Bayes and Support Vector
Machine (SVM) algorithm using RapidMiner software.
However, those previous work [6, 7] did not classify Twitter
data from all user (regular and official) in real-time.
In this research, we propose a novel real-time traffic
classification by classifying Twitter data into traffic or
non_traffic category. Classification was validated using ten
folds cross validation to measure accuracy, precision, recall, and
F-score of the classifiers and dataset.
II. DATA ACQUISITION
Tweets dataset about Yogyakarta Province, Indonesia were
used in this research to build classification model. We
categorized tweets into traffic and non_traffic. The data were
collected consecutively in seven days.
In the first stage, traffic tweets were collected from seven
official traffic monitoring Twitter accounts, namely
@ATCS_DIY, @atcs_kotasmrg, @atcs_kotatgr,
@atcs_pekalongan, @ntmclantaspolri, @tmcpoldametro, and
@tmcpolressemara. The data were collected using Twitter
REST API. Tweets from those accounts were cleaned from
non_traffic tweets, thus only tweets related with traffic condition
were considered. In the second stage, tweets were collected
using Twitter Streaming API from selected user in Table I and
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
978-1-5090-4139-8/16/$31.00 ©2016 IEEE

Table II. We label the tweets with traffic and non_traffic
manually.
III. PROPOSED METHOD
The flowchart of the proposed method is shown in Fig. 1.
Fig. 1. Flowchart of the proposed method.
A. Tweet Collection from Twitter Streaming API
New tweets were collected from Twitter Streaming API in
real time. There were some parameters that we used in Twitter
Streaming API, such as follow and track parameters. Follow
parameter was used to get new tweets in real time from several
accounts, as shown in Table I. Track parameter was used to get
new tweets in real time based on keywords defined in Table II.
TABLE I
TWITTER USERNAMES AND IDS USED IN FOLLOW PARAMETER
Twitter Username Twitter User ID
@lalinjogja 250022672
@RTMC_Jogja 187397386
@ATCS_DIY 1118238337
@twit_macet 4675666764
@JogjaUpdate 128175561
@Jogja24Jam 537556372
@infojogja 106780531
@YogyakartaCity 62327666
@JogjaMedia 454564576
@tribunjogja 223476605
@unisifmyk 201720189
TABLE II
KEYWORDS USED IN TRACK PARAMETER
Yogyakarta Jogjakarta Jogja
Yogya Adisutjipto Adi Sutjipto
lalinjogja RTMC_Jogja ATCS_DIY
jogjaupdate jogja24jam infojogja
yogyakartacity jogjamedia tribunjogja
unisifmyk UGM UII
UNY UMY lalinyk
B. Preprocessing
Preprocessing stage was applied to tweets to clean some
parts of tweets that were not needed in the next stages [8]. The
preprocessing steps in this study were as follows:
a) Removing the "RT". At this step, we used regular
expression "RT s" to find the appearances of "RT".
b) Converting all letters in a tweet to lowercase.
c) Removing website address in the tweet. At this step, we
use regular expression "shttp.+s".
d) Removing Twitter username. At this step, we used
regular expression "@[a-zA-Z0-9_]+".
e) Removing characters non-alphanumeric (alphabets and
numbers) characters. At this stage, we used the regular
expression "[^a-zA-Z0-9]".
f) Changing abbreviations to their actual phrases. We
changed abbreviations that frequently appeared in tweets.
C. Feature Extraction
This research used two types of feature extraction. The first
one was by using all words in the dataset as features. The second
one was by using only few selected words as features. Features
were selected by their appearance in the dataset. We selected
words that appear most frequently in the traffic tweets dataset
[9]. The steps were explained as follows:
a) Processing traffic tweets dataset with preprocessing
steps.
b) Analyzing words appearance from the dataset.
c) Sorting by words that appeared most frequently.
d) Taking 50 words that appeared most frequently.
e) Removing unneeded words, such as person name, place
name, etc.
f) Removing words that had two letters or less.
A dictionary contained the words and their appearance count
in a tweet was used as classifier.
D. Classifier Model Building with Machine Learning
Algorithms
Tweets were classified into two categories, namely tweets
that were related to traffic (traffic) and tweets that were not
related to traffic (non_traffic). This classification was intended
to separated tweets about traffic from another tweets. Three
machine learning algorithms, there are Naïve Bayes (NB),

Support Vector Machine (SVM), and Decision Tree (DT) were
used in this research.
Four parameters were used to evaluate performance [9] of
each machine learning algorithm:
1) Accuracy: Accuracy is the fraction of the classifications
result that are correct. The formula is
2) Precision: Precision is the fraction of the predicted
documents in a class that are correct. The formula is
3) Recall: Recall is the fraction of documents in a class that
correctly predicted by the system. The formula is
4) F-score: F-score is a weighted harmonic mean of
precision and recall. We use balanced F-score with formula
We only measured precision, recall, and F-score in traffic
class. We calculated all parameters of evaluation through ten
folds cross validation technique.
IV. EXPERIMENTAL RESULTS
A. Tweets Data Acquisition
In tweet data acquisition stage, we collected 110,449 tweets
data in total. This data were used in building classification
model. The 110,449 tweets data consisted of 17,592 tweets in
traffic class and 92,857 tweets in non_traffic class.
B. Tweet Collection from Twitter Streaming API
TABLE III
TWEETS FROM TWITTER STREAMING API
Date and Time Tweet Text
2016-03-15
11:31:36
UN 2016 : Tryout di SMA Muhammadiyah 3
Jogja Diikuti Ribuan Peserta
https://t.co/4bMhW4xoow
https://t.co/zwpfMm8A57
2016-04-30
09:55:44
09.55 wib lalin seputaran sp condongcatur
ramai lancar https://t.co/HRwTeIzlyt
Twitter Streaming API was a real time data source for our
system. With Twitter Streaming API, Twitter sent tweet objects
in form of JavaScript Object Notation (JSON) once there was a
tweet match with our follow and track parameters. There were
many variables in a tweet JSON object, but we only used
created_at and text variable. Table III shows example of tweets
received by our system.
C. Preprocessing
Preprocessing stage was used to prepare tweet text before
processed in the next stages. There were some preprocessing
steps as explained in the previous section. The example of
preprocessing result is shown in Table IV.
TABLE IV
PREPROCESSING RESULT
Original Tweet After Preprocessing
UN 2016 : Tryout di SMA
Muhammadiyah 3 Jogja Diikuti Ribuan
Peserta https://t.co/4bMhW4xoow
https://t.co/zwpfMm8A57
un 2016 tryout di sma
muhammadiyah 3
jogja diikuti ribuan
peserta
09.55 wib lalin seputaran sp
condongcatur ramai lancar
https://t.co/HRwTeIzlyt
09 55 wib lalu lintas
seputaran simpang
condongcatur ramai
lancar
D. Feature Extraction
Feature extraction process counted the occurrences all the
words in a tweet as features. The dictionary that contained the
words and their occurrences was used to train classifier. A
dictionary was a set of data in key-value form. The key and the
value were the words and their occurrences, respectively.
Feature extraction by using only few selected words was then
preceded with feature selection process. In feature selection
process, we got 40 words as features as shown in Table V.
TABLE V
LIST OF FEATURES
Bahasa
Indonesia
English
Bahasa
Indonesia
English
antrian queue maupun although
arah direction mengarah directing
arus flow menuju heading
atau or pada on
barat west padat congested
cerah sunny patuhi obey
cuaca weather pukul o'clock
dalam in ramai crowded
dan and rambu sign
dari from sebaliknya opposite
informasikan inform selatan south
jalan road/street semua all
kaki foot seputaran around
kami we simpang intersection
kendaraan vehicle situasi situation
kondisi condition terpantau observed
kota city tetap still
lalu part of phrase
“traffic” in
Bahasa
timur east
lancar smooth utara north
lintas part of phrase
“traffic” in
Bahasa
wib Western
Indonesian
Time
E. Development of Classifier Model using Machine Learning
Algorithms
The amount of data used in this research were 110,449
tweets consisted of 17,592 traffic tweets and 92,857 non_traffic
tweets. The dataset was imbalanced between classes. Thus the
evaluation measurement of classification model was conducted
for both imbalanced dataset and balanced dataset. Evaluation
measurement for imbalanced dataset used all 110,449 tweets.
Evaluation measurement for balanced dataset used 35,184
tweets consisted of 17,592 traffic tweets and 17,592 non_traffic

tweets. The 17,592 non_traffic tweets were selected randomly
from 92,857 non_traffic tweets.
TABLE VI
EVALUATION MEASUREMENT OF BALANCED DATASET (35,184 TWEETS)
Feature Model Accuracy Precision Recall F-score
All
words
NB 99.37% 99.10% 99.62% 99.36%
SVM 99.77% 99.65% 99.89% 99.77%
DT 99.48% 99.44% 99.52% 99.48%
Selected
words
NB 98.02% 96.32% 99.71% 97.99%
SVM 98.31% 97.23% 99.37% 98.29%
DT 98.41% 97.52% 99.28% 98.39%
Table VI shows that for balanced dataset with 35,184 tweets
and all words as features, SVM yielded the best performance in
all measurements (99.77%) as shown with yellow color.
However, by using only selected words as features, DT yielded
the best accuracy, precision, and F-score as shown with yellow
color.
TABLE VII
EVALUATION MEASUREMENT OF IMBALANCED DATASET (110,449 TWEETS)
Feature Model Accuracy Precision Recall F-score
All
words
NB 99.76% 98.94% 99.52% 99.23%
SVM 99.87% 99.41% 99.80% 99.60%
DT 99.70% 98.76% 99.34% 99.05%
Selected
words
NB 99.23% 95.74% 99.39% 97.53%
SVM 99.23% 96.43% 98.68% 97.54%
DT 99.42% 96.75% 99.57% 98.14%
As for imbalanced dataset with 110,449 tweets and all words
as features, SVM yielded the best performance in all
measurement as shown in Table VII while DT produced best
accuracy with only selected words as features. From Table VI
and Table VII, we can see that the amount of data affected
classification performance. More data produced better
classification performance. Furthermore, improvements
depends on the implemented algorithms For instance, SVM was
found to be quite sensitive to imbalanced dataset [10]. The
above-mentioned results were due to no feature selection needed
by SVM to improve accuracy [11]. On the contrary, feature
selection affected DT performance since too much and too
specific features produced unneeded tree branch that caused
overfitting [12].
TABLE VIII
TRAINING TIME OF MODELS
Feature Model
Training time
35,184 tweets
(seconds)
Training time
110,449 tweets
(seconds)
All
words
NB 1.068 2.129
SVM 1.510 4.011
DT 3.660 18.560
Selected
words
NB 1.335 4.642
SVM 2.332 7.483
DT 1.372 4.793
We evaluated training time as another aspect of classification
model. The training time displayed in Table VIII is average
training time of ten folds cross validation. As shown in Table
VIII, NB has the fastest training time because its simple model
building. Moreover, DT training time was greatly influenced by
the number of data and the number of features.
After evaluating three machine learning algorithms,
imbalanced dataset with 110,449 tweets and feature extraction
using all words as features was used in the application. This
dataset was selected because it produced accuracy than balanced
dataset with 35,184 tweets. Moreover, the real data fetched from
Twitter Streaming API is highly imbalanced between traffic and
non_traffic category. Extraction of all words as features was
used because it produced better accuracy than using only
selected words as features.
V. CONCLUSION
This research aims to develop a traffic tweet classification of
Yogyakarta Province (Indonesia) in real time. We evaluated
three machine learning algorithms to find the best algorithms to
classify tweets data in real-time. As for imbalanced and balanced
dataset, Support Vector Machine (SVM) algorithm produced the
best performance using all words as features, while Decision
Tree (DT) algorithm yielded best performance using only
selected words as feature. Experimental results show that SVM
algorithm produced the best performance among the other
algorithms with 99.77% and 99.87% of classification accuracy
in balanced and imbalanced data, respectively. We also found
that feature selection algorithm used in this research did not
improve accuracy. Furthermore, feature selection and the
amount of data affected the performance of classification model.
Further research is needed to investigate appropriate approach
for better classification regardless the amount of mined Twitter
data.
REFERENCES
[1] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users:
real-time event detection by social sensors,” in Proceedings of the 19th
international conference on World wide web, 2010, pp. 851–860.
[2] E. D’Andrea, P. Ducange, B. Lazzerini, and F. Marcelloni, “Real-Time
Detection of Traffic From Twitter Stream Analysis,” IEEE Trans. Intell.
Transp. Syst., vol. 16, no. 4, pp. 2269–2283, Aug. 2015.
[3] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,
“Real-time event extraction for driving information from social sensors,”
in 2012 IEEE International Conference on Cyber Technology in
Automation, Control, and Intelligent Systems (CYBER), 2012, pp. 221–
226.
[4] Y. Gu, Z. (Sean) Qian, and F. Chen, “From Twitter to detector: Real-time
traffic incident detection using social media data,” Transp. Res. Part C
Emerg. Technol., vol. 67, pp. 321–342, Jun. 2016.
[5] C. Gutiérrez, P. Figuerias, P. Oliveira, R. Costa, and R. Jardim-
Goncalves, “Twitter mining for traffic events detection,” in Science and
Information Conference (SAI), 2015, 2015, pp. 371–378.
[6] A. Wibisono, I. Sina, M. A. Ihsannuddin, A. Hafizh, B. Hardjono, A.
Nurhadiyatna, W. Jatmiko, and d P. Mursanto, “Traffic intelligent system
architecture based on social media information,” in 2012 International
Conference on Advanced Computer Science and Information Systems
(ICACSIS), 2012, pp. 25–30.
[7] S. F. Rodiyansyah and E. Winarko, “Klasifikasi Posting Twitter
Kemacetan Lalu Lintas Kota Bandung Menggunakan Naive Bayesian
Classification,” IJCCS-Indones. J. Comput. Cybern. Syst., vol. 7, no. 1,
pp. 13–22, 2013.
[8] N. Monarizqa, L. E. Nugroho, and B. S. Hantono, “Penerapan Analisis
Sentimen pada Twitter Berbahasa Indonesia sebagai Pemberi Rating,”
Universitas Gadjah Mada, Perpustakaan Pusat UGM, 2014.

[9] C. D. Manning, P. Raghavan, H. Schütze, and others, Introduction to
information retrieval, vol. 1. Cambridge university press Cambridge,
2008.
[10] R. Batuwita and V. Palade, “Class Imbalance Learning Methods for
Support Vector Machines,” in “Imbalanced Learning: Foundations,
Algorithms, and Applications”, Haibo He and Yunqian Ma (Eds.), Wiley,
(book chapter), 2013.
[11] T. Joachims, “Text Categorization with Support Vector Machines:
Learning with Many Relevant Features,” in European Conference on
Machine Learning (ECML), Berlin, 1998, pp. 137–142.
[12] R. Garreta and G. Moncecchi, Learning scikit-learn: machine learning in
Python : experience the benefits of machine learning techniques by
applying them to real-world problems using Python and the open source
scikit-learn library. 2013.

Kurniawan2016

More Related Content

What's hot

Similar to Kurniawan2016

Recently uploaded

Kurniawan2016