SlideShare a Scribd company logo
1 of 14
2
INTRODUCTION
With the development of new smart technologies, the world is going digital. The
increasing scope of the web and the large amount of electronic data piling up
throughout the web has provoked the exploration of hidden information from their
text content. Looking up for the precise and relevant information and extracting it
from the web has now become a time-consuming task. There are many techniques
used for the Web - information extraction and text mining is one of them.
Twitter is one of the famous social platforms with 316 M user in the world and 500
M tweets sent per day. In India, twitter has 22.2 M users (source :
http://www.huffingtonpost.in). With such a vast amount of data available, twitter
has been used as a source of unstructured data to perform varied Data analytics
insights.
Every year famous magazines publish "most livable cities in world" list and each city wants
to be the most livable to attract business and investments, boost local economies and real
estate markets. Here I focused on text mining techniques, k means algorithm and
classification to identify livable cities in India by categorizing the tweets-its sentiments using
different criteria which includes social and economic circumstances for residents, public
health, infrastructure, and ease and availability of local transport
3
NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ?
The research Question : ‘how Livable is a city based on the comments and
views on twitter with the use of Text Mining’
Objectives are: To provide a dynamic algorithm with can label the twitter
feeds & reduce the complexity of ranking a city in different categories based
on twitter views
Scope and Limitations:
The Scope of this project is the Twitter Views on Indian Cities.
The Limitation is the dataset is that the feeds are from a single day
‘25/08/2015’
4
THE FLOW
EXTRACT
OF
TWITTER
FEEDS
THE SEMMA APPROACH
.Json file
format
Converted
to .csv
Removal of
Duplicate
Texts
Language-
English
Pareto of
Top cities
Loading the
corpus
DTM
Creation
Stop-Word
Removal
Tokenizatio
n
Loading the
corpus
TERM
ANALYSIS
K-Means
Clustering
Labelling
based on
the Clusters
Classificatio
n of Tweets
based on
Labels
Result
1.SAMPLE 2.EXPLORE
4.MODEL
3.MODIFY
Results
Exploration
CITY
RANKING
RESULTS
CONCLUSIONS
5.ASSESS
5
What are the Data Attributes?
The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each
Twitter Feed
From the list of 24 Attributes, selected 11 attributes to proceed with the project
S.no Attribute Name S.no Attribute Name
1 links 13 user_name
2 text 14 sentiment_type
3 topics 15 reach
4 application_rating 16 user_city
5 application_store 17 user_language
6 created_time 18 device
7 city 19 application_version
8 user_id 20 keyword
9 sentiment 21 language
10 application 22 country
11 engagement 23 uri
12 source 24 user_country
S.no Attributes Selected S.no Attributes Selected
1 text 6 keyword
2 created_time 7 language
3 sentiment 8 application
4 sentiment_type 9 user_country
5 reach 10 device
6
What did the Explore Stage Result ?
Post Cleaning up of the Data, applied
the Pareto !
The Top 30 Cities resulted contributed
to 95% of the volume
On Further exploration
(post data processing)
found the Top Terms
with Frequencies
7
Model Phase Results ?
There are various methods of clustering and K-means is one of the most efficient ways for clustering.
From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is
partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that
particular cluster and dissimilar to the elements belonging to the other cluster.
Clustering was done to identify the clusters of terms in turn enabling us to label the data !
With this K-means clustering exercise
using Euclidian distance; 25 clusters
(best fit) were obtained
8
Model Phase Results ? Contd…
The 25 clusters which have the list of
terms are extracted to a .csv file
The terms in each cluster are reviewed
manually and a Label is given to each
cluster
With this step, 6 Labels for the clusters
are identified:
Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text
documents for the purpose of classification. A number is assigned for each document and data frame is
created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a
binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category
identified.
9
Conclusions--- the final results !
The resulted extract has each Tweet labelled. There 25180 records in the final extract
Few tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excluded
Post the above step, there are 25140 records left
By City – the Count of Labelled Tweets:
The Top city is New Delhi followed by Mumbai and Bangalore
10
Conclusions--- the final results !
It can identified that the maximum number of tweets
were on Lifestyle, Career and Infrastructure and are
mostly Neutral in nature
The above Tree map depicts the Label versus the
user reach. The Highest is for the Label ‘Lifestyle’
and the Least is for ‘law and Order’
11
Conclusions --- the final results !
Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule
An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities:
New Delhi has the highest positive and negative scores
12
Conclusions--- the final results !
OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely
13
Conclusions--- the final results !
CITY CAREER CRIME EDUCATION
ENVIRONMENT
& HEALTH
INFRASTRUCTURE
LAW &
ORDER
LIFESTYLE
Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19
Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81
Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50
Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90
Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05
Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54
Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42
New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07
Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63
Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86
RANKING OF CITIES BY LABEL
The Above table gives a snapshot of scores of the City by Label
For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City
For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone
For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely
For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely
For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely
For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least
For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least
14

More Related Content

Similar to Indian cities ranking based on twitter feeds using advanced analytics

Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
IIRindia
 
marketAnalyticsFinal
marketAnalyticsFinalmarketAnalyticsFinal
marketAnalyticsFinal
Vivek Kumar
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18
Karthikeyan Rajasekharan
 

Similar to Indian cities ranking based on twitter feeds using advanced analytics (20)

Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015Master in Big Data Analytics and Social Mining 20015
Master in Big Data Analytics and Social Mining 20015
 
PWC: Data Driven Cities [2016]
PWC: Data Driven Cities [2016]PWC: Data Driven Cities [2016]
PWC: Data Driven Cities [2016]
 
IRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using CobwebIRJET - Twitter Spam Detection using Cobweb
IRJET - Twitter Spam Detection using Cobweb
 
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...
 
Social Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network ContextSocial Friend Overlying Communities Based on Social Network Context
Social Friend Overlying Communities Based on Social Network Context
 
Ebay OLAP Cube
Ebay OLAP CubeEbay OLAP Cube
Ebay OLAP Cube
 
Service Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data MiningService Level Comparison for Online Shopping using Data Mining
Service Level Comparison for Online Shopping using Data Mining
 
Greedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networksGreedy Incremental approach for unfolding of communities in massive networks
Greedy Incremental approach for unfolding of communities in massive networks
 
CML's Presentation at FengChia University
CML's Presentation at FengChia UniversityCML's Presentation at FengChia University
CML's Presentation at FengChia University
 
marketAnalyticsFinal
marketAnalyticsFinalmarketAnalyticsFinal
marketAnalyticsFinal
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
deep_Visualization in Data mining.ppt
deep_Visualization in Data mining.pptdeep_Visualization in Data mining.ppt
deep_Visualization in Data mining.ppt
 
Ijciet 08 02_024
Ijciet 08 02_024Ijciet 08 02_024
Ijciet 08 02_024
 
MINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGES
MINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGESMINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGES
MINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGES
 
Synopsis of project of MTech - III Sem in AKTU
Synopsis of project of MTech - III Sem in AKTUSynopsis of project of MTech - III Sem in AKTU
Synopsis of project of MTech - III Sem in AKTU
 
EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18EffectiveCrowdSourcingForProductFeatureIdeation v18
EffectiveCrowdSourcingForProductFeatureIdeation v18
 
Finding prominent features in communities in social networks using ontology
Finding prominent features in communities in social networks using ontologyFinding prominent features in communities in social networks using ontology
Finding prominent features in communities in social networks using ontology
 
ONS local presents clustering
ONS local presents clusteringONS local presents clustering
ONS local presents clustering
 
Zelt thilo
Zelt thilo Zelt thilo
Zelt thilo
 
Vikalp - Automatic multiple choice questions generator
Vikalp - Automatic multiple choice questions generatorVikalp - Automatic multiple choice questions generator
Vikalp - Automatic multiple choice questions generator
 

Recently uploaded

如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Stephen266013
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
a8om7o51
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 

Recently uploaded (20)

Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI  MANAJEMEN OF PENYAKIT TETANUS.pptMATERI  MANAJEMEN OF PENYAKIT TETANUS.ppt
MATERI MANAJEMEN OF PENYAKIT TETANUS.ppt
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...Genuine love spell caster )! ,+27834335081)   Ex lover back permanently in At...
Genuine love spell caster )! ,+27834335081) Ex lover back permanently in At...
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
Identify Rules that Predict Patient’s Heart Disease - An Application of Decis...
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Audience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptxAudience Researchndfhcvnfgvgbhujhgfv.pptx
Audience Researchndfhcvnfgvgbhujhgfv.pptx
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster AnalysisData Analysis Project Presentation : NYC Shooting Cluster Analysis
Data Analysis Project Presentation : NYC Shooting Cluster Analysis
 
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
如何办理加州大学伯克利分校毕业证(UCB毕业证)成绩单留信学历认证
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 

Indian cities ranking based on twitter feeds using advanced analytics

  • 1.
  • 2. 2 INTRODUCTION With the development of new smart technologies, the world is going digital. The increasing scope of the web and the large amount of electronic data piling up throughout the web has provoked the exploration of hidden information from their text content. Looking up for the precise and relevant information and extracting it from the web has now become a time-consuming task. There are many techniques used for the Web - information extraction and text mining is one of them. Twitter is one of the famous social platforms with 316 M user in the world and 500 M tweets sent per day. In India, twitter has 22.2 M users (source : http://www.huffingtonpost.in). With such a vast amount of data available, twitter has been used as a source of unstructured data to perform varied Data analytics insights. Every year famous magazines publish "most livable cities in world" list and each city wants to be the most livable to attract business and investments, boost local economies and real estate markets. Here I focused on text mining techniques, k means algorithm and classification to identify livable cities in India by categorizing the tweets-its sentiments using different criteria which includes social and economic circumstances for residents, public health, infrastructure, and ease and availability of local transport
  • 3. 3 NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ? The research Question : ‘how Livable is a city based on the comments and views on twitter with the use of Text Mining’ Objectives are: To provide a dynamic algorithm with can label the twitter feeds & reduce the complexity of ranking a city in different categories based on twitter views Scope and Limitations: The Scope of this project is the Twitter Views on Indian Cities. The Limitation is the dataset is that the feeds are from a single day ‘25/08/2015’
  • 4. 4 THE FLOW EXTRACT OF TWITTER FEEDS THE SEMMA APPROACH .Json file format Converted to .csv Removal of Duplicate Texts Language- English Pareto of Top cities Loading the corpus DTM Creation Stop-Word Removal Tokenizatio n Loading the corpus TERM ANALYSIS K-Means Clustering Labelling based on the Clusters Classificatio n of Tweets based on Labels Result 1.SAMPLE 2.EXPLORE 4.MODEL 3.MODIFY Results Exploration CITY RANKING RESULTS CONCLUSIONS 5.ASSESS
  • 5. 5 What are the Data Attributes? The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each Twitter Feed From the list of 24 Attributes, selected 11 attributes to proceed with the project S.no Attribute Name S.no Attribute Name 1 links 13 user_name 2 text 14 sentiment_type 3 topics 15 reach 4 application_rating 16 user_city 5 application_store 17 user_language 6 created_time 18 device 7 city 19 application_version 8 user_id 20 keyword 9 sentiment 21 language 10 application 22 country 11 engagement 23 uri 12 source 24 user_country S.no Attributes Selected S.no Attributes Selected 1 text 6 keyword 2 created_time 7 language 3 sentiment 8 application 4 sentiment_type 9 user_country 5 reach 10 device
  • 6. 6 What did the Explore Stage Result ? Post Cleaning up of the Data, applied the Pareto ! The Top 30 Cities resulted contributed to 95% of the volume On Further exploration (post data processing) found the Top Terms with Frequencies
  • 7. 7 Model Phase Results ? There are various methods of clustering and K-means is one of the most efficient ways for clustering. From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that particular cluster and dissimilar to the elements belonging to the other cluster. Clustering was done to identify the clusters of terms in turn enabling us to label the data ! With this K-means clustering exercise using Euclidian distance; 25 clusters (best fit) were obtained
  • 8. 8 Model Phase Results ? Contd… The 25 clusters which have the list of terms are extracted to a .csv file The terms in each cluster are reviewed manually and a Label is given to each cluster With this step, 6 Labels for the clusters are identified: Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text documents for the purpose of classification. A number is assigned for each document and data frame is created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category identified.
  • 9. 9 Conclusions--- the final results ! The resulted extract has each Tweet labelled. There 25180 records in the final extract Few tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excluded Post the above step, there are 25140 records left By City – the Count of Labelled Tweets: The Top city is New Delhi followed by Mumbai and Bangalore
  • 10. 10 Conclusions--- the final results ! It can identified that the maximum number of tweets were on Lifestyle, Career and Infrastructure and are mostly Neutral in nature The above Tree map depicts the Label versus the user reach. The Highest is for the Label ‘Lifestyle’ and the Least is for ‘law and Order’
  • 11. 11 Conclusions --- the final results ! Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities: New Delhi has the highest positive and negative scores
  • 12. 12 Conclusions--- the final results ! OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely
  • 13. 13 Conclusions--- the final results ! CITY CAREER CRIME EDUCATION ENVIRONMENT & HEALTH INFRASTRUCTURE LAW & ORDER LIFESTYLE Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19 Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81 Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50 Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90 Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05 Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54 Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42 New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07 Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63 Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86 RANKING OF CITIES BY LABEL The Above table gives a snapshot of scores of the City by Label For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least
  • 14. 14