Indian cities ranking based on twitter feeds using advanced analytics
1.
2. 2
INTRODUCTION
With the development of new smart technologies, the world is going digital. The
increasing scope of the web and the large amount of electronic data piling up
throughout the web has provoked the exploration of hidden information from their
text content. Looking up for the precise and relevant information and extracting it
from the web has now become a time-consuming task. There are many techniques
used for the Web - information extraction and text mining is one of them.
Twitter is one of the famous social platforms with 316 M user in the world and 500
M tweets sent per day. In India, twitter has 22.2 M users (source :
http://www.huffingtonpost.in). With such a vast amount of data available, twitter
has been used as a source of unstructured data to perform varied Data analytics
insights.
Every year famous magazines publish "most livable cities in world" list and each city wants
to be the most livable to attract business and investments, boost local economies and real
estate markets. Here I focused on text mining techniques, k means algorithm and
classification to identify livable cities in India by categorizing the tweets-its sentiments using
different criteria which includes social and economic circumstances for residents, public
health, infrastructure, and ease and availability of local transport
3. 3
NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ?
The research Question : ‘how Livable is a city based on the comments and
views on twitter with the use of Text Mining’
Objectives are: To provide a dynamic algorithm with can label the twitter
feeds & reduce the complexity of ranking a city in different categories based
on twitter views
Scope and Limitations:
The Scope of this project is the Twitter Views on Indian Cities.
The Limitation is the dataset is that the feeds are from a single day
‘25/08/2015’
4. 4
THE FLOW
EXTRACT
OF
TWITTER
FEEDS
THE SEMMA APPROACH
.Json file
format
Converted
to .csv
Removal of
Duplicate
Texts
Language-
English
Pareto of
Top cities
Loading the
corpus
DTM
Creation
Stop-Word
Removal
Tokenizatio
n
Loading the
corpus
TERM
ANALYSIS
K-Means
Clustering
Labelling
based on
the Clusters
Classificatio
n of Tweets
based on
Labels
Result
1.SAMPLE 2.EXPLORE
4.MODEL
3.MODIFY
Results
Exploration
CITY
RANKING
RESULTS
CONCLUSIONS
5.ASSESS
5. 5
What are the Data Attributes?
The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each
Twitter Feed
From the list of 24 Attributes, selected 11 attributes to proceed with the project
S.no Attribute Name S.no Attribute Name
1 links 13 user_name
2 text 14 sentiment_type
3 topics 15 reach
4 application_rating 16 user_city
5 application_store 17 user_language
6 created_time 18 device
7 city 19 application_version
8 user_id 20 keyword
9 sentiment 21 language
10 application 22 country
11 engagement 23 uri
12 source 24 user_country
S.no Attributes Selected S.no Attributes Selected
1 text 6 keyword
2 created_time 7 language
3 sentiment 8 application
4 sentiment_type 9 user_country
5 reach 10 device
6. 6
What did the Explore Stage Result ?
Post Cleaning up of the Data, applied
the Pareto !
The Top 30 Cities resulted contributed
to 95% of the volume
On Further exploration
(post data processing)
found the Top Terms
with Frequencies
7. 7
Model Phase Results ?
There are various methods of clustering and K-means is one of the most efficient ways for clustering.
From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is
partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that
particular cluster and dissimilar to the elements belonging to the other cluster.
Clustering was done to identify the clusters of terms in turn enabling us to label the data !
With this K-means clustering exercise
using Euclidian distance; 25 clusters
(best fit) were obtained
8. 8
Model Phase Results ? Contd…
The 25 clusters which have the list of
terms are extracted to a .csv file
The terms in each cluster are reviewed
manually and a Label is given to each
cluster
With this step, 6 Labels for the clusters
are identified:
Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text
documents for the purpose of classification. A number is assigned for each document and data frame is
created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a
binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category
identified.
9. 9
Conclusions--- the final results !
The resulted extract has each Tweet labelled. There 25180 records in the final extract
Few tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excluded
Post the above step, there are 25140 records left
By City – the Count of Labelled Tweets:
The Top city is New Delhi followed by Mumbai and Bangalore
10. 10
Conclusions--- the final results !
It can identified that the maximum number of tweets
were on Lifestyle, Career and Infrastructure and are
mostly Neutral in nature
The above Tree map depicts the Label versus the
user reach. The Highest is for the Label ‘Lifestyle’
and the Least is for ‘law and Order’
11. 11
Conclusions --- the final results !
Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule
An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities:
New Delhi has the highest positive and negative scores
12. 12
Conclusions--- the final results !
OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely
13. 13
Conclusions--- the final results !
CITY CAREER CRIME EDUCATION
ENVIRONMENT
& HEALTH
INFRASTRUCTURE
LAW &
ORDER
LIFESTYLE
Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19
Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81
Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50
Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90
Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05
Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54
Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42
New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07
Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63
Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86
RANKING OF CITIES BY LABEL
The Above table gives a snapshot of scores of the City by Label
For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City
For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone
For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely
For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely
For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely
For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least
For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least