Real-time Traffic Classification with Twitter Data
Mining
Dwi Aji Kurniawan1
, Sunu Wibirama2
, Noor Akhmad Setiawan3
Department of Electrical Engineering and Information Technology
Universitas Gadjah Mada
Indonesia
1
dwi.aji.k@mail.ugm.ac.id, 2
sunu@ugm.ac.id, 3
noorwewe@ugm.ac.id
Abstract— The growth of vehicles in Yogyakarta Province,
Indonesia is not proportional to the growth of roads. This problem
causes severe traffic jam in many main roads. Common traffic
anomalies detection using surveillance camera requires manpower
and costly, while traffic anomalies detection with crowdsourcing
mobile applications are mostly owned by private. This research
aims to develop a real-time traffic classification by harnessing the
power of social network data, Twitter. In this study, Twitter data
are processed to the stages of preprocessing, feature extraction,
and tweet classification. This study compares classification
performance of three machine learning algorithms, namely Naive
Bayes (NB), Support Vector Machine (SVM), and Decision Tree
(DT). Experimental results show that SVM algorithm produced
the best performance among the other algorithms with 99.77%
and 99.87% of classification accuracy in balanced and imbalanced
data, respectively. This research implies that social network
service may be used as an alternative source for traffic anomalies
detection by providing information of traffic flow condition in
real-time.
Keywords— traffic, data mining in Twitter, social network, tweet
classification, machine learning.
I. INTRODUCTION
The growth of vehicles in big cities is not proportional to the
growth of roads. Sooner or later, roads in big cities will be
increasingly jammed. Installation of surveillance cameras in
some streets and intersections has been common approach of
real-time traffic anomalies detection. Nevertheless, this
approach requires manpower to observe the cameras and to
locate spatial position of the traffic information. On the other
side, location-based crowdsourcing technology such as Waze
(https://www.waze.com) is currently used as driver’s companion
for route finding. However, Waze is a proprietary service, thus
the authorities may find it difficult to get access to the data.
Social network service has been used to detect traffic
anomalies and events. An approach developed by Sakaki et al.
[1] shows that Twitter detect an event faster than traditional
media. D'Andrea et al. [2] compared the performance of seven
classification algorithms to classify Italian tweets. Sakaki et al.
[3] proposed four stages to detect locations in Japanese tweets.
Gu et al. [4] classify tweets about traffic in the city of Pittsburgh
and Philadelphia (USA) using the Semi Naive Bayes (SNB) and
Supervised Latent Dirichlet Allocation (sLDA). Gutiérrez et al.
[5] classify English tweets using Support Vector Machine
(SVM) classifier in RapidMiner software. These studies used
local language as their source of information. Moreover, these
studies focused more on traffic events such as accidents, road
work, snow, road closures, but did not focus on the state of
traffic flow, such as traffic jams, crowded, crowded, crowded
smooth, and smooth.
There were also some studies in Indonesia related to the use
of social networking as traffic conditions monitoring. Research
by Wibisono et al. [6] in Jakarta used the concept of Learning
Vector Quantization (LVQ) neural network to classify tweets
into three classes: low traffic flow, medium traffic flow, and
high traffic flow. The system developed by Wibisono et al. [6]
used tweet from the official account of traffic officers as a data
source. Another study in Bandung by Rodiyansyah and Winarko
[7] classified four classes (Loss, Current, Unknown, and Model)
on traffic tweets using Naive Bayes and Support Vector
Machine (SVM) algorithm using RapidMiner software.
However, those previous work [6, 7] did not classify Twitter
data from all user (regular and official) in real-time.
In this research, we propose a novel real-time traffic
classification by classifying Twitter data into traffic or
non_traffic category. Classification was validated using ten
folds cross validation to measure accuracy, precision, recall, and
F-score of the classifiers and dataset.
II. DATA ACQUISITION
Tweets dataset about Yogyakarta Province, Indonesia were
used in this research to build classification model. We
categorized tweets into traffic and non_traffic. The data were
collected consecutively in seven days.
In the first stage, traffic tweets were collected from seven
official traffic monitoring Twitter accounts, namely
@ATCS_DIY, @atcs_kotasmrg, @atcs_kotatgr,
@atcs_pekalongan, @ntmclantaspolri, @tmcpoldametro, and
@tmcpolressemara. The data were collected using Twitter
REST API. Tweets from those accounts were cleaned from
non_traffic tweets, thus only tweets related with traffic condition
were considered. In the second stage, tweets were collected
using Twitter Streaming API from selected user in Table I and
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
978-1-5090-4139-8/16/$31.00 ©2016 IEEE
Table II. We label the tweets with traffic and non_traffic
manually.
III. PROPOSED METHOD
The flowchart of the proposed method is shown in Fig. 1.
Fig. 1. Flowchart of the proposed method.
A. Tweet Collection from Twitter Streaming API
New tweets were collected from Twitter Streaming API in
real time. There were some parameters that we used in Twitter
Streaming API, such as follow and track parameters. Follow
parameter was used to get new tweets in real time from several
accounts, as shown in Table I. Track parameter was used to get
new tweets in real time based on keywords defined in Table II.
TABLE I
TWITTER USERNAMES AND IDS USED IN FOLLOW PARAMETER
Twitter Username Twitter User ID
@lalinjogja 250022672
@RTMC_Jogja 187397386
@ATCS_DIY 1118238337
@twit_macet 4675666764
@JogjaUpdate 128175561
@Jogja24Jam 537556372
@infojogja 106780531
@YogyakartaCity 62327666
@JogjaMedia 454564576
@tribunjogja 223476605
@unisifmyk 201720189
TABLE II
KEYWORDS USED IN TRACK PARAMETER
Yogyakarta Jogjakarta Jogja
Yogya Adisutjipto Adi Sutjipto
lalinjogja RTMC_Jogja ATCS_DIY
jogjaupdate jogja24jam infojogja
yogyakartacity jogjamedia tribunjogja
unisifmyk UGM UII
UNY UMY lalinyk
B. Preprocessing
Preprocessing stage was applied to tweets to clean some
parts of tweets that were not needed in the next stages [8]. The
preprocessing steps in this study were as follows:
a) Removing the "RT". At this step, we used regular
expression "RT  s" to find the appearances of "RT".
b) Converting all letters in a tweet to lowercase.
c) Removing website address in the tweet. At this step, we
use regular expression "shttp.+s".
d) Removing Twitter username. At this step, we used
regular expression "@[a-zA-Z0-9_]+".
e) Removing characters non-alphanumeric (alphabets and
numbers) characters. At this stage, we used the regular
expression "[^a-zA-Z0-9]".
f) Changing abbreviations to their actual phrases. We
changed abbreviations that frequently appeared in tweets.
C. Feature Extraction
This research used two types of feature extraction. The first
one was by using all words in the dataset as features. The second
one was by using only few selected words as features. Features
were selected by their appearance in the dataset. We selected
words that appear most frequently in the traffic tweets dataset
[9]. The steps were explained as follows:
a) Processing traffic tweets dataset with preprocessing
steps.
b) Analyzing words appearance from the dataset.
c) Sorting by words that appeared most frequently.
d) Taking 50 words that appeared most frequently.
e) Removing unneeded words, such as person name, place
name, etc.
f) Removing words that had two letters or less.
A dictionary contained the words and their appearance count
in a tweet was used as classifier.
D. Classifier Model Building with Machine Learning
Algorithms
Tweets were classified into two categories, namely tweets
that were related to traffic (traffic) and tweets that were not
related to traffic (non_traffic). This classification was intended
to separated tweets about traffic from another tweets. Three
machine learning algorithms, there are Naïve Bayes (NB),
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
Support Vector Machine (SVM), and Decision Tree (DT) were
used in this research.
Four parameters were used to evaluate performance [9] of
each machine learning algorithm:
1) Accuracy: Accuracy is the fraction of the classifications
result that are correct. The formula is
2) Precision: Precision is the fraction of the predicted
documents in a class that are correct. The formula is
3) Recall: Recall is the fraction of documents in a class that
correctly predicted by the system. The formula is
4) F-score: F-score is a weighted harmonic mean of
precision and recall. We use balanced F-score with formula
We only measured precision, recall, and F-score in traffic
class. We calculated all parameters of evaluation through ten
folds cross validation technique.
IV. EXPERIMENTAL RESULTS
A. Tweets Data Acquisition
In tweet data acquisition stage, we collected 110,449 tweets
data in total. This data were used in building classification
model. The 110,449 tweets data consisted of 17,592 tweets in
traffic class and 92,857 tweets in non_traffic class.
B. Tweet Collection from Twitter Streaming API
TABLE III
TWEETS FROM TWITTER STREAMING API
Date and Time Tweet Text
2016-03-15
11:31:36
UN 2016 : Tryout di SMA Muhammadiyah 3
Jogja Diikuti Ribuan Peserta
https://t.co/4bMhW4xoow
https://t.co/zwpfMm8A57
2016-04-30
09:55:44
09.55 wib lalin seputaran sp condongcatur
ramai lancar https://t.co/HRwTeIzlyt
Twitter Streaming API was a real time data source for our
system. With Twitter Streaming API, Twitter sent tweet objects
in form of JavaScript Object Notation (JSON) once there was a
tweet match with our follow and track parameters. There were
many variables in a tweet JSON object, but we only used
created_at and text variable. Table III shows example of tweets
received by our system.
C. Preprocessing
Preprocessing stage was used to prepare tweet text before
processed in the next stages. There were some preprocessing
steps as explained in the previous section. The example of
preprocessing result is shown in Table IV.
TABLE IV
PREPROCESSING RESULT
Original Tweet After Preprocessing
UN 2016 : Tryout di SMA
Muhammadiyah 3 Jogja Diikuti Ribuan
Peserta https://t.co/4bMhW4xoow
https://t.co/zwpfMm8A57
un 2016 tryout di sma
muhammadiyah 3
jogja diikuti ribuan
peserta
09.55 wib lalin seputaran sp
condongcatur ramai lancar
https://t.co/HRwTeIzlyt
09 55 wib lalu lintas
seputaran simpang
condongcatur ramai
lancar
D. Feature Extraction
Feature extraction process counted the occurrences all the
words in a tweet as features. The dictionary that contained the
words and their occurrences was used to train classifier. A
dictionary was a set of data in key-value form. The key and the
value were the words and their occurrences, respectively.
Feature extraction by using only few selected words was then
preceded with feature selection process. In feature selection
process, we got 40 words as features as shown in Table V.
TABLE V
LIST OF FEATURES
Bahasa
Indonesia
English
Bahasa
Indonesia
English
antrian queue maupun although
arah direction mengarah directing
arus flow menuju heading
atau or pada on
barat west padat congested
cerah sunny patuhi obey
cuaca weather pukul o'clock
dalam in ramai crowded
dan and rambu sign
dari from sebaliknya opposite
informasikan inform selatan south
jalan road/street semua all
kaki foot seputaran around
kami we simpang intersection
kendaraan vehicle situasi situation
kondisi condition terpantau observed
kota city tetap still
lalu part of phrase
“traffic” in
Bahasa
timur east
lancar smooth utara north
lintas part of phrase
“traffic” in
Bahasa
wib Western
Indonesian
Time
E. Development of Classifier Model using Machine Learning
Algorithms
The amount of data used in this research were 110,449
tweets consisted of 17,592 traffic tweets and 92,857 non_traffic
tweets. The dataset was imbalanced between classes. Thus the
evaluation measurement of classification model was conducted
for both imbalanced dataset and balanced dataset. Evaluation
measurement for imbalanced dataset used all 110,449 tweets.
Evaluation measurement for balanced dataset used 35,184
tweets consisted of 17,592 traffic tweets and 17,592 non_traffic
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
tweets. The 17,592 non_traffic tweets were selected randomly
from 92,857 non_traffic tweets.
TABLE VI
EVALUATION MEASUREMENT OF BALANCED DATASET (35,184 TWEETS)
Feature Model Accuracy Precision Recall F-score
All
words
NB 99.37% 99.10% 99.62% 99.36%
SVM 99.77% 99.65% 99.89% 99.77%
DT 99.48% 99.44% 99.52% 99.48%
Selected
words
NB 98.02% 96.32% 99.71% 97.99%
SVM 98.31% 97.23% 99.37% 98.29%
DT 98.41% 97.52% 99.28% 98.39%
Table VI shows that for balanced dataset with 35,184 tweets
and all words as features, SVM yielded the best performance in
all measurements (99.77%) as shown with yellow color.
However, by using only selected words as features, DT yielded
the best accuracy, precision, and F-score as shown with yellow
color.
TABLE VII
EVALUATION MEASUREMENT OF IMBALANCED DATASET (110,449 TWEETS)
Feature Model Accuracy Precision Recall F-score
All
words
NB 99.76% 98.94% 99.52% 99.23%
SVM 99.87% 99.41% 99.80% 99.60%
DT 99.70% 98.76% 99.34% 99.05%
Selected
words
NB 99.23% 95.74% 99.39% 97.53%
SVM 99.23% 96.43% 98.68% 97.54%
DT 99.42% 96.75% 99.57% 98.14%
As for imbalanced dataset with 110,449 tweets and all words
as features, SVM yielded the best performance in all
measurement as shown in Table VII while DT produced best
accuracy with only selected words as features. From Table VI
and Table VII, we can see that the amount of data affected
classification performance. More data produced better
classification performance. Furthermore, improvements
depends on the implemented algorithms For instance, SVM was
found to be quite sensitive to imbalanced dataset [10]. The
above-mentioned results were due to no feature selection needed
by SVM to improve accuracy [11]. On the contrary, feature
selection affected DT performance since too much and too
specific features produced unneeded tree branch that caused
overfitting [12].
TABLE VIII
TRAINING TIME OF MODELS
Feature Model
Training time
35,184 tweets
(seconds)
Training time
110,449 tweets
(seconds)
All
words
NB 1.068 2.129
SVM 1.510 4.011
DT 3.660 18.560
Selected
words
NB 1.335 4.642
SVM 2.332 7.483
DT 1.372 4.793
We evaluated training time as another aspect of classification
model. The training time displayed in Table VIII is average
training time of ten folds cross validation. As shown in Table
VIII, NB has the fastest training time because its simple model
building. Moreover, DT training time was greatly influenced by
the number of data and the number of features.
After evaluating three machine learning algorithms,
imbalanced dataset with 110,449 tweets and feature extraction
using all words as features was used in the application. This
dataset was selected because it produced accuracy than balanced
dataset with 35,184 tweets. Moreover, the real data fetched from
Twitter Streaming API is highly imbalanced between traffic and
non_traffic category. Extraction of all words as features was
used because it produced better accuracy than using only
selected words as features.
V. CONCLUSION
This research aims to develop a traffic tweet classification of
Yogyakarta Province (Indonesia) in real time. We evaluated
three machine learning algorithms to find the best algorithms to
classify tweets data in real-time. As for imbalanced and balanced
dataset, Support Vector Machine (SVM) algorithm produced the
best performance using all words as features, while Decision
Tree (DT) algorithm yielded best performance using only
selected words as feature. Experimental results show that SVM
algorithm produced the best performance among the other
algorithms with 99.77% and 99.87% of classification accuracy
in balanced and imbalanced data, respectively. We also found
that feature selection algorithm used in this research did not
improve accuracy. Furthermore, feature selection and the
amount of data affected the performance of classification model.
Further research is needed to investigate appropriate approach
for better classification regardless the amount of mined Twitter
data.
REFERENCES
[1] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users:
real-time event detection by social sensors,” in Proceedings of the 19th
international conference on World wide web, 2010, pp. 851–860.
[2] E. D’Andrea, P. Ducange, B. Lazzerini, and F. Marcelloni, “Real-Time
Detection of Traffic From Twitter Stream Analysis,” IEEE Trans. Intell.
Transp. Syst., vol. 16, no. 4, pp. 2269–2283, Aug. 2015.
[3] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa,
“Real-time event extraction for driving information from social sensors,”
in 2012 IEEE International Conference on Cyber Technology in
Automation, Control, and Intelligent Systems (CYBER), 2012, pp. 221–
226.
[4] Y. Gu, Z. (Sean) Qian, and F. Chen, “From Twitter to detector: Real-time
traffic incident detection using social media data,” Transp. Res. Part C
Emerg. Technol., vol. 67, pp. 321–342, Jun. 2016.
[5] C. Gutiérrez, P. Figuerias, P. Oliveira, R. Costa, and R. Jardim-
Goncalves, “Twitter mining for traffic events detection,” in Science and
Information Conference (SAI), 2015, 2015, pp. 371–378.
[6] A. Wibisono, I. Sina, M. A. Ihsannuddin, A. Hafizh, B. Hardjono, A.
Nurhadiyatna, W. Jatmiko, and d P. Mursanto, “Traffic intelligent system
architecture based on social media information,” in 2012 International
Conference on Advanced Computer Science and Information Systems
(ICACSIS), 2012, pp. 25–30.
[7] S. F. Rodiyansyah and E. Winarko, “Klasifikasi Posting Twitter
Kemacetan Lalu Lintas Kota Bandung Menggunakan Naive Bayesian
Classification,” IJCCS-Indones. J. Comput. Cybern. Syst., vol. 7, no. 1,
pp. 13–22, 2013.
[8] N. Monarizqa, L. E. Nugroho, and B. S. Hantono, “Penerapan Analisis
Sentimen pada Twitter Berbahasa Indonesia sebagai Pemberi Rating,”
Universitas Gadjah Mada, Perpustakaan Pusat UGM, 2014.
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
[9] C. D. Manning, P. Raghavan, H. Schütze, and others, Introduction to
information retrieval, vol. 1. Cambridge university press Cambridge,
2008.
[10] R. Batuwita and V. Palade, “Class Imbalance Learning Methods for
Support Vector Machines,” in “Imbalanced Learning: Foundations,
Algorithms, and Applications”, Haibo He and Yunqian Ma (Eds.), Wiley,
(book chapter), 2013.
[11] T. Joachims, “Text Categorization with Support Vector Machines:
Learning with Many Relevant Features,” in European Conference on
Machine Learning (ECML), Berlin, 1998, pp. 137–142.
[12] R. Garreta and G. Moncecchi, Learning scikit-learn: machine learning in
Python : experience the benefits of machine learning techniques by
applying them to real-world problems using Python and the open source
scikit-learn library. 2013.
2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Kurniawan2016

  • 1.
    Real-time Traffic Classificationwith Twitter Data Mining Dwi Aji Kurniawan1 , Sunu Wibirama2 , Noor Akhmad Setiawan3 Department of Electrical Engineering and Information Technology Universitas Gadjah Mada Indonesia 1 dwi.aji.k@mail.ugm.ac.id, 2 sunu@ugm.ac.id, 3 noorwewe@ugm.ac.id Abstract— The growth of vehicles in Yogyakarta Province, Indonesia is not proportional to the growth of roads. This problem causes severe traffic jam in many main roads. Common traffic anomalies detection using surveillance camera requires manpower and costly, while traffic anomalies detection with crowdsourcing mobile applications are mostly owned by private. This research aims to develop a real-time traffic classification by harnessing the power of social network data, Twitter. In this study, Twitter data are processed to the stages of preprocessing, feature extraction, and tweet classification. This study compares classification performance of three machine learning algorithms, namely Naive Bayes (NB), Support Vector Machine (SVM), and Decision Tree (DT). Experimental results show that SVM algorithm produced the best performance among the other algorithms with 99.77% and 99.87% of classification accuracy in balanced and imbalanced data, respectively. This research implies that social network service may be used as an alternative source for traffic anomalies detection by providing information of traffic flow condition in real-time. Keywords— traffic, data mining in Twitter, social network, tweet classification, machine learning. I. INTRODUCTION The growth of vehicles in big cities is not proportional to the growth of roads. Sooner or later, roads in big cities will be increasingly jammed. Installation of surveillance cameras in some streets and intersections has been common approach of real-time traffic anomalies detection. Nevertheless, this approach requires manpower to observe the cameras and to locate spatial position of the traffic information. On the other side, location-based crowdsourcing technology such as Waze (https://www.waze.com) is currently used as driver’s companion for route finding. However, Waze is a proprietary service, thus the authorities may find it difficult to get access to the data. Social network service has been used to detect traffic anomalies and events. An approach developed by Sakaki et al. [1] shows that Twitter detect an event faster than traditional media. D'Andrea et al. [2] compared the performance of seven classification algorithms to classify Italian tweets. Sakaki et al. [3] proposed four stages to detect locations in Japanese tweets. Gu et al. [4] classify tweets about traffic in the city of Pittsburgh and Philadelphia (USA) using the Semi Naive Bayes (SNB) and Supervised Latent Dirichlet Allocation (sLDA). Gutiérrez et al. [5] classify English tweets using Support Vector Machine (SVM) classifier in RapidMiner software. These studies used local language as their source of information. Moreover, these studies focused more on traffic events such as accidents, road work, snow, road closures, but did not focus on the state of traffic flow, such as traffic jams, crowded, crowded, crowded smooth, and smooth. There were also some studies in Indonesia related to the use of social networking as traffic conditions monitoring. Research by Wibisono et al. [6] in Jakarta used the concept of Learning Vector Quantization (LVQ) neural network to classify tweets into three classes: low traffic flow, medium traffic flow, and high traffic flow. The system developed by Wibisono et al. [6] used tweet from the official account of traffic officers as a data source. Another study in Bandung by Rodiyansyah and Winarko [7] classified four classes (Loss, Current, Unknown, and Model) on traffic tweets using Naive Bayes and Support Vector Machine (SVM) algorithm using RapidMiner software. However, those previous work [6, 7] did not classify Twitter data from all user (regular and official) in real-time. In this research, we propose a novel real-time traffic classification by classifying Twitter data into traffic or non_traffic category. Classification was validated using ten folds cross validation to measure accuracy, precision, recall, and F-score of the classifiers and dataset. II. DATA ACQUISITION Tweets dataset about Yogyakarta Province, Indonesia were used in this research to build classification model. We categorized tweets into traffic and non_traffic. The data were collected consecutively in seven days. In the first stage, traffic tweets were collected from seven official traffic monitoring Twitter accounts, namely @ATCS_DIY, @atcs_kotasmrg, @atcs_kotatgr, @atcs_pekalongan, @ntmclantaspolri, @tmcpoldametro, and @tmcpolressemara. The data were collected using Twitter REST API. Tweets from those accounts were cleaned from non_traffic tweets, thus only tweets related with traffic condition were considered. In the second stage, tweets were collected using Twitter Streaming API from selected user in Table I and 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia 978-1-5090-4139-8/16/$31.00 ©2016 IEEE
  • 2.
    Table II. Welabel the tweets with traffic and non_traffic manually. III. PROPOSED METHOD The flowchart of the proposed method is shown in Fig. 1. Fig. 1. Flowchart of the proposed method. A. Tweet Collection from Twitter Streaming API New tweets were collected from Twitter Streaming API in real time. There were some parameters that we used in Twitter Streaming API, such as follow and track parameters. Follow parameter was used to get new tweets in real time from several accounts, as shown in Table I. Track parameter was used to get new tweets in real time based on keywords defined in Table II. TABLE I TWITTER USERNAMES AND IDS USED IN FOLLOW PARAMETER Twitter Username Twitter User ID @lalinjogja 250022672 @RTMC_Jogja 187397386 @ATCS_DIY 1118238337 @twit_macet 4675666764 @JogjaUpdate 128175561 @Jogja24Jam 537556372 @infojogja 106780531 @YogyakartaCity 62327666 @JogjaMedia 454564576 @tribunjogja 223476605 @unisifmyk 201720189 TABLE II KEYWORDS USED IN TRACK PARAMETER Yogyakarta Jogjakarta Jogja Yogya Adisutjipto Adi Sutjipto lalinjogja RTMC_Jogja ATCS_DIY jogjaupdate jogja24jam infojogja yogyakartacity jogjamedia tribunjogja unisifmyk UGM UII UNY UMY lalinyk B. Preprocessing Preprocessing stage was applied to tweets to clean some parts of tweets that were not needed in the next stages [8]. The preprocessing steps in this study were as follows: a) Removing the "RT". At this step, we used regular expression "RT s" to find the appearances of "RT". b) Converting all letters in a tweet to lowercase. c) Removing website address in the tweet. At this step, we use regular expression "shttp.+s". d) Removing Twitter username. At this step, we used regular expression "@[a-zA-Z0-9_]+". e) Removing characters non-alphanumeric (alphabets and numbers) characters. At this stage, we used the regular expression "[^a-zA-Z0-9]". f) Changing abbreviations to their actual phrases. We changed abbreviations that frequently appeared in tweets. C. Feature Extraction This research used two types of feature extraction. The first one was by using all words in the dataset as features. The second one was by using only few selected words as features. Features were selected by their appearance in the dataset. We selected words that appear most frequently in the traffic tweets dataset [9]. The steps were explained as follows: a) Processing traffic tweets dataset with preprocessing steps. b) Analyzing words appearance from the dataset. c) Sorting by words that appeared most frequently. d) Taking 50 words that appeared most frequently. e) Removing unneeded words, such as person name, place name, etc. f) Removing words that had two letters or less. A dictionary contained the words and their appearance count in a tweet was used as classifier. D. Classifier Model Building with Machine Learning Algorithms Tweets were classified into two categories, namely tweets that were related to traffic (traffic) and tweets that were not related to traffic (non_traffic). This classification was intended to separated tweets about traffic from another tweets. Three machine learning algorithms, there are Naïve Bayes (NB), 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
  • 3.
    Support Vector Machine(SVM), and Decision Tree (DT) were used in this research. Four parameters were used to evaluate performance [9] of each machine learning algorithm: 1) Accuracy: Accuracy is the fraction of the classifications result that are correct. The formula is 2) Precision: Precision is the fraction of the predicted documents in a class that are correct. The formula is 3) Recall: Recall is the fraction of documents in a class that correctly predicted by the system. The formula is 4) F-score: F-score is a weighted harmonic mean of precision and recall. We use balanced F-score with formula We only measured precision, recall, and F-score in traffic class. We calculated all parameters of evaluation through ten folds cross validation technique. IV. EXPERIMENTAL RESULTS A. Tweets Data Acquisition In tweet data acquisition stage, we collected 110,449 tweets data in total. This data were used in building classification model. The 110,449 tweets data consisted of 17,592 tweets in traffic class and 92,857 tweets in non_traffic class. B. Tweet Collection from Twitter Streaming API TABLE III TWEETS FROM TWITTER STREAMING API Date and Time Tweet Text 2016-03-15 11:31:36 UN 2016 : Tryout di SMA Muhammadiyah 3 Jogja Diikuti Ribuan Peserta https://t.co/4bMhW4xoow https://t.co/zwpfMm8A57 2016-04-30 09:55:44 09.55 wib lalin seputaran sp condongcatur ramai lancar https://t.co/HRwTeIzlyt Twitter Streaming API was a real time data source for our system. With Twitter Streaming API, Twitter sent tweet objects in form of JavaScript Object Notation (JSON) once there was a tweet match with our follow and track parameters. There were many variables in a tweet JSON object, but we only used created_at and text variable. Table III shows example of tweets received by our system. C. Preprocessing Preprocessing stage was used to prepare tweet text before processed in the next stages. There were some preprocessing steps as explained in the previous section. The example of preprocessing result is shown in Table IV. TABLE IV PREPROCESSING RESULT Original Tweet After Preprocessing UN 2016 : Tryout di SMA Muhammadiyah 3 Jogja Diikuti Ribuan Peserta https://t.co/4bMhW4xoow https://t.co/zwpfMm8A57 un 2016 tryout di sma muhammadiyah 3 jogja diikuti ribuan peserta 09.55 wib lalin seputaran sp condongcatur ramai lancar https://t.co/HRwTeIzlyt 09 55 wib lalu lintas seputaran simpang condongcatur ramai lancar D. Feature Extraction Feature extraction process counted the occurrences all the words in a tweet as features. The dictionary that contained the words and their occurrences was used to train classifier. A dictionary was a set of data in key-value form. The key and the value were the words and their occurrences, respectively. Feature extraction by using only few selected words was then preceded with feature selection process. In feature selection process, we got 40 words as features as shown in Table V. TABLE V LIST OF FEATURES Bahasa Indonesia English Bahasa Indonesia English antrian queue maupun although arah direction mengarah directing arus flow menuju heading atau or pada on barat west padat congested cerah sunny patuhi obey cuaca weather pukul o'clock dalam in ramai crowded dan and rambu sign dari from sebaliknya opposite informasikan inform selatan south jalan road/street semua all kaki foot seputaran around kami we simpang intersection kendaraan vehicle situasi situation kondisi condition terpantau observed kota city tetap still lalu part of phrase “traffic” in Bahasa timur east lancar smooth utara north lintas part of phrase “traffic” in Bahasa wib Western Indonesian Time E. Development of Classifier Model using Machine Learning Algorithms The amount of data used in this research were 110,449 tweets consisted of 17,592 traffic tweets and 92,857 non_traffic tweets. The dataset was imbalanced between classes. Thus the evaluation measurement of classification model was conducted for both imbalanced dataset and balanced dataset. Evaluation measurement for imbalanced dataset used all 110,449 tweets. Evaluation measurement for balanced dataset used 35,184 tweets consisted of 17,592 traffic tweets and 17,592 non_traffic 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
  • 4.
    tweets. The 17,592non_traffic tweets were selected randomly from 92,857 non_traffic tweets. TABLE VI EVALUATION MEASUREMENT OF BALANCED DATASET (35,184 TWEETS) Feature Model Accuracy Precision Recall F-score All words NB 99.37% 99.10% 99.62% 99.36% SVM 99.77% 99.65% 99.89% 99.77% DT 99.48% 99.44% 99.52% 99.48% Selected words NB 98.02% 96.32% 99.71% 97.99% SVM 98.31% 97.23% 99.37% 98.29% DT 98.41% 97.52% 99.28% 98.39% Table VI shows that for balanced dataset with 35,184 tweets and all words as features, SVM yielded the best performance in all measurements (99.77%) as shown with yellow color. However, by using only selected words as features, DT yielded the best accuracy, precision, and F-score as shown with yellow color. TABLE VII EVALUATION MEASUREMENT OF IMBALANCED DATASET (110,449 TWEETS) Feature Model Accuracy Precision Recall F-score All words NB 99.76% 98.94% 99.52% 99.23% SVM 99.87% 99.41% 99.80% 99.60% DT 99.70% 98.76% 99.34% 99.05% Selected words NB 99.23% 95.74% 99.39% 97.53% SVM 99.23% 96.43% 98.68% 97.54% DT 99.42% 96.75% 99.57% 98.14% As for imbalanced dataset with 110,449 tweets and all words as features, SVM yielded the best performance in all measurement as shown in Table VII while DT produced best accuracy with only selected words as features. From Table VI and Table VII, we can see that the amount of data affected classification performance. More data produced better classification performance. Furthermore, improvements depends on the implemented algorithms For instance, SVM was found to be quite sensitive to imbalanced dataset [10]. The above-mentioned results were due to no feature selection needed by SVM to improve accuracy [11]. On the contrary, feature selection affected DT performance since too much and too specific features produced unneeded tree branch that caused overfitting [12]. TABLE VIII TRAINING TIME OF MODELS Feature Model Training time 35,184 tweets (seconds) Training time 110,449 tweets (seconds) All words NB 1.068 2.129 SVM 1.510 4.011 DT 3.660 18.560 Selected words NB 1.335 4.642 SVM 2.332 7.483 DT 1.372 4.793 We evaluated training time as another aspect of classification model. The training time displayed in Table VIII is average training time of ten folds cross validation. As shown in Table VIII, NB has the fastest training time because its simple model building. Moreover, DT training time was greatly influenced by the number of data and the number of features. After evaluating three machine learning algorithms, imbalanced dataset with 110,449 tweets and feature extraction using all words as features was used in the application. This dataset was selected because it produced accuracy than balanced dataset with 35,184 tweets. Moreover, the real data fetched from Twitter Streaming API is highly imbalanced between traffic and non_traffic category. Extraction of all words as features was used because it produced better accuracy than using only selected words as features. V. CONCLUSION This research aims to develop a traffic tweet classification of Yogyakarta Province (Indonesia) in real time. We evaluated three machine learning algorithms to find the best algorithms to classify tweets data in real-time. As for imbalanced and balanced dataset, Support Vector Machine (SVM) algorithm produced the best performance using all words as features, while Decision Tree (DT) algorithm yielded best performance using only selected words as feature. Experimental results show that SVM algorithm produced the best performance among the other algorithms with 99.77% and 99.87% of classification accuracy in balanced and imbalanced data, respectively. We also found that feature selection algorithm used in this research did not improve accuracy. Furthermore, feature selection and the amount of data affected the performance of classification model. Further research is needed to investigate appropriate approach for better classification regardless the amount of mined Twitter data. REFERENCES [1] T. Sakaki, M. Okazaki, and Y. Matsuo, “Earthquake shakes Twitter users: real-time event detection by social sensors,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 851–860. [2] E. D’Andrea, P. Ducange, B. Lazzerini, and F. Marcelloni, “Real-Time Detection of Traffic From Twitter Stream Analysis,” IEEE Trans. Intell. Transp. Syst., vol. 16, no. 4, pp. 2269–2283, Aug. 2015. [3] T. Sakaki, Y. Matsuo, T. Yanagihara, N. P. Chandrasiri, and K. Nawa, “Real-time event extraction for driving information from social sensors,” in 2012 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER), 2012, pp. 221– 226. [4] Y. Gu, Z. (Sean) Qian, and F. Chen, “From Twitter to detector: Real-time traffic incident detection using social media data,” Transp. Res. Part C Emerg. Technol., vol. 67, pp. 321–342, Jun. 2016. [5] C. Gutiérrez, P. Figuerias, P. Oliveira, R. Costa, and R. Jardim- Goncalves, “Twitter mining for traffic events detection,” in Science and Information Conference (SAI), 2015, 2015, pp. 371–378. [6] A. Wibisono, I. Sina, M. A. Ihsannuddin, A. Hafizh, B. Hardjono, A. Nurhadiyatna, W. Jatmiko, and d P. Mursanto, “Traffic intelligent system architecture based on social media information,” in 2012 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2012, pp. 25–30. [7] S. F. Rodiyansyah and E. Winarko, “Klasifikasi Posting Twitter Kemacetan Lalu Lintas Kota Bandung Menggunakan Naive Bayesian Classification,” IJCCS-Indones. J. Comput. Cybern. Syst., vol. 7, no. 1, pp. 13–22, 2013. [8] N. Monarizqa, L. E. Nugroho, and B. S. Hantono, “Penerapan Analisis Sentimen pada Twitter Berbahasa Indonesia sebagai Pemberi Rating,” Universitas Gadjah Mada, Perpustakaan Pusat UGM, 2014. 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
  • 5.
    [9] C. D.Manning, P. Raghavan, H. Schütze, and others, Introduction to information retrieval, vol. 1. Cambridge university press Cambridge, 2008. [10] R. Batuwita and V. Palade, “Class Imbalance Learning Methods for Support Vector Machines,” in “Imbalanced Learning: Foundations, Algorithms, and Applications”, Haibo He and Yunqian Ma (Eds.), Wiley, (book chapter), 2013. [11] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” in European Conference on Machine Learning (ECML), Berlin, 1998, pp. 137–142. [12] R. Garreta and G. Moncecchi, Learning scikit-learn: machine learning in Python : experience the benefits of machine learning techniques by applying them to real-world problems using Python and the open source scikit-learn library. 2013. 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia