The document discusses using text mining techniques on Twitter data to identify the most livable cities in India. It aims to categorize tweets about different Indian cities based on criteria like career, crime, education, environment, infrastructure, law and order, and lifestyle. The process involves extracting Twitter feeds, preprocessing the data, performing k-means clustering to identify labels, classifying tweets based on these labels, and finally ranking the cities. Key results show Gurgaon ranked as the most livable city overall while Ahmedabad was the least likely. New Delhi received the highest scores for career, environment, infrastructure and lifestyle.
This document advertises a conference on apprenticeships in England to be held on March 13th, 2013. The conference is aimed at businesses and training providers and encourages them to work together in partnership to help apprenticeships in England grow. Attendees can book their place online at the provided website.
Nueva plataforma de gestión documental y ECM de AdaptingAdapting
Este documento presenta una nueva plataforma de gestión documental y ECM de Adapting para integrar con distribuidores. Ofrece una solución modular de gestión documental con funcionalidades como gestión de usuarios, carpetas, documentos y expedientes, así como captura de documentos y flujos de trabajo. La plataforma es 100% web, personalizable y escalable, y se puede implementar como un proyecto llave en mano a través de socios distribuidores.
The document provides information about tryouts for One World Soccer's select soccer teams. It outlines the tryout dates and locations for different age groups, what players and parents should expect at tryouts, and the commitment required for select soccer. It also details the pricing structure for select soccer, including annual fees that can be paid in full or installments, uniform costs, and sibling discounts. Players who make a select team will be expected to attend three practices per week and participate in tournaments. The presentation aims to inform parents and players about the select soccer program at One World Soccer.
This document advertises a conference on apprenticeships in England to be held on March 13th, 2013. The conference is aimed at businesses and training providers and encourages them to work together in partnership to help apprenticeships in England grow. Attendees can book their place online at the provided website.
Nueva plataforma de gestión documental y ECM de AdaptingAdapting
Este documento presenta una nueva plataforma de gestión documental y ECM de Adapting para integrar con distribuidores. Ofrece una solución modular de gestión documental con funcionalidades como gestión de usuarios, carpetas, documentos y expedientes, así como captura de documentos y flujos de trabajo. La plataforma es 100% web, personalizable y escalable, y se puede implementar como un proyecto llave en mano a través de socios distribuidores.
The document provides information about tryouts for One World Soccer's select soccer teams. It outlines the tryout dates and locations for different age groups, what players and parents should expect at tryouts, and the commitment required for select soccer. It also details the pricing structure for select soccer, including annual fees that can be paid in full or installments, uniform costs, and sibling discounts. Players who make a select team will be expected to attend three practices per week and participate in tournaments. The presentation aims to inform parents and players about the select soccer program at One World Soccer.
A União Europeia está preocupada com o aumento da desinformação online e propôs novas regras para combater as notícias falsas. As novas regras exigiriam que as plataformas de mídia social monitorassem o conteúdo ativamente e removessem rapidamente qualquer conteúdo considerado falso ou enganoso que possa prejudicar a saúde pública ou a segurança. Algumas organizações expressaram preocupações sobre como as novas regras podem afetar a liberdade de expressão.
The author came to college to obtain qualifications for a higher paying job, as blue collar jobs only requiring a high school education offer unattractive salaries. They are seeking an education in social work and law to help the misfortunate and disabled. They chose to attend Albany State University because of its reputation for small class sizes that foster discussion, critical thinking, and networking. While college will be challenging, the author is committed to persisting through difficulties to pursue their career goals in social work and law.
Este documento describe el módulo mod_autoindex de Apache, el cual ayuda a crear listados automáticos de contenido de directorios del servidor. Explica las diferentes directivas como AddAlt, AddDescription, AddIcon, IndexOptions e IndexStyleSheet que permiten personalizar la apariencia y comportamiento de los listados de directorios generados automáticamente.
There are two main types of chemical bonds: ionic bonds and covalent bonds. Ionic bonds are the force of attraction between two oppositely charged ions in a molecule, such as in sodium fluoride (NaF). Covalent bonds are formed by the mutual sharing of electrons between two atoms in a molecule.
Diseño de sistemas de gestión documental orientados a procesos en el marco de...Jordi Serra Serra
Esta presentación describe la evolución de la gestión documental hacia un entorno de gobierno integrado de la información, y presenta una metodología basada en procesos para el diseño e implantación de sistemas de gestión documental.
This presentation describes the evolution from records management to an integrated information governance framework, and proposes a process based methodology for records and information management systems design and development.
Citación recomendada: Diseño de sistemas de gestión documental orientados a procesos en el marco del gobierno de la información. XXII Jornadas de Archivos Universitarios "La gestión de documentos en el entorno digital". Tarragona, 19 de mayo de 2016.
Kursus Kejurulatihan Permainan Tenis ini bertujuan untuk menambah pengetahuan guru mengenai peraturan dan teknik tenis serta membantu mereka menjadi jurulatih yang lebih mahir. Program ini akan diadakan pada 10 Januari 2014 di IPGK Kampus Temenggong Ibrahim dan akan dilengkapi dengan kursus, bengkel latihan, dan sumber daya untuk meningkatkan kemahiran mengajar tenis.
The document discusses the concept of data-driven cities and provides examples from several major cities. It finds that while there is no single definition of a data-driven city, key elements include the generation and analysis of data to improve living standards through social, economic, and environmental initiatives. The study analyzed 28 global cities and identified Moscow, New York, London, Barcelona, and Sydney as technological leaders due to their extensive use of data-driven solutions across areas like transportation, utilities, security and citizen engagement. While each city demonstrates strengths in various technologies and policy areas, New York stands out as the overall leader in terms of development and implementation of data-driven practices in urban management.
IRJET - Twitter Spam Detection using CobwebIRJET Journal
This document discusses using Cobweb clustering and Gradient Boosting techniques to detect spam on Twitter. Cobweb clustering creates a classification tree to predict attributes of new objects by summarizing the attribute distributions of existing nodes. Gradient Boosting is an ensemble method that uses multiple weak learners (decision trees) to create a stronger predictive model. The paper aims to combine these techniques to create an enhanced spam detection system. It also reviews several existing approaches for Twitter spam detection using techniques like Hidden Markov Models, Random Forests, and asynchronous link-based algorithms.
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...reshma reshu
Social networking services have been prevalent at many online communities such as Twitter.com and Weibo.com, where millions of users keep interacting with each other every day. One interesting and important problem in the social networking services is to rank users based on their vitality in a timely fashion. An accurate ranking list of user vitality could benefit many parties in social network services such as the ads providers and site operators. Although it is very promising to obtain a vitality-based ranking list of users, there are many technical challenges due to the large scale and dynamics of social networking data .
Social Friend Overlying Communities Based on Social Network ContextIRJET Journal
This document discusses algorithms for detecting overlapping communities in social networks. It begins with an introduction to social networks and community detection. It then reviews various algorithms that have been proposed for detecting overlapping communities, including clique percolation methods, fuzzy detection algorithms, agent-based and dynamic algorithms, and more. It also discusses using these algorithms to recommend friends and locations to users based on their behaviors and communities within social networks. The document presents results from applying these algorithms and concludes by discussing opportunities for future work improving recommendation performance.
This document summarizes the creation of an OLAP cube analysis for an eBay company called ColoradoDirtCheap.com. The analysis was created using SQL Server tools to extract, transform, and load data from CSV files into a data warehouse. Dimensional models and cubes were then created in Analysis Services. Reports on key metrics like profit percentage and sales growth were developed in Reporting Services and distributed to management for analysis. The analysis found that adding a new 10-tube product listing led to a significant increase in quarterly sales growth.
Service Level Comparison for Online Shopping using Data MiningIIRindia
The term knowledge discovery in databases (KDD) is the analysis step of data mining. The data mining goal is to extract the knowledge and patterns from large data sets, not the data extraction itself. Big-Data Computing is a critical challenge for the ICT industry. Engineers and researchers are dealing with the cloud computing paradigm of petabyte data sets. Thus the demand for building a service stack to distribute, manage and process massive data sets has risen drastically. We investigate the problem for a single source node to broadcast the big chunk of data sets to a set of nodes to minimize the maximum completion time. These nodes may locate in the same datacenter or across geo-distributed data centers. The Big-data broadcasting problem is modeled into a LockStep Broadcast Tree (LSBT) problem. And the main idea of the LSBT is defining a basic unit of upload bandwidth, r, a node with capacity c broadcasts data to a set of [c=r] children at the rate r. Note that r is a parameter to be optimized as part of the LSBT problem. The broadcast data are further divided into m chunks. In a pipeline manner, these m chunks can then be broadcast down the LSBT. In a homogeneous network environment in which each node has the same upload capacity c, the optimal uplink rate r, of LSBT is either c=2 or 3, whichever gives the smaller maximum completion time. For heterogeneous environments, an O(nlog2n) algorithm is presented to select an optimal uplink rate r, and to construct an optimal LSBT. With lower computational complexity and low maximum completion time, the numerical results shows better performance.The methodology includes Various Web applications Building and Broadcasting followed by the Gateway Application and Batch Processing over the TSV Data after which the Web Crawling for Resources and MapReduce process takes place and finally Picking Products from Recommendations and Purchasing it.
The document presents a new greedy incremental approach for community detection in social networks. It begins by calculating the degree of nodes and sorting them in descending order. Initial communities are formed with the highest degree nodes. Then nodes are incrementally added to communities if it increases the community density. The approach is tested on standard datasets and able to detect communities reasonably well in less dense graphs. However, there is scope to improve performance on very dense graphs such as implementing it in parallel processing.
The document discusses social media security challenges related to cognition, cross-platform issues, and push algorithms. It covers topics like abuse targeting internal or external victims, security issues on social media, and the life cycle and influence of social media posts. Detection of multiple accounts and geolocation identification on social media are also summarized.
1. The document discusses market analytics using predictive modeling and clustering techniques in R. It provides background on CDAC and predictive analytics, describing how segmentation can group customers into clusters like inactive, new active, low-value active, and high-value active.
2. Common clustering algorithms for segmentation are discussed, including k-means and hierarchical agglomerative clustering. K-means requires specifying the number of clusters k, while HAC creates a dendrogram and requires deciding on merge points.
3. Limitations of clustering for market analytics include needing to update models frequently as new customer data is added and market data changes over time. Stability of clusters can also be an issue.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
The document discusses visual data mining techniques and their application to spatial data analysis. It proposes the CubeView system which uses a data cube structure to support data mining and visualization of large spatial datasets. CubeView allows selective visualization through spatial outlier detection algorithms that identify suspiciously deviating observations in the data. The system was applied to traffic data from road sensor networks to enable analysis of traffic patterns and outliers.
MINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGESIAEME Publication
The document describes a system for mining opinions from tweets about traffic status. Around 5000 traffic-related tweets were collected and preprocessed by removing stop words and punctuation. The tweets were manually labeled as expressing positive (p) or negative (n) opinions. Various classifiers were trained on the labeled data and evaluated. The top 7 classifiers were combined into an ensemble model, which achieved an F-measure of 87.15% for classifying tweets in the test set, indicating the system is effective at mining opinions on traffic status from tweets.
The document describes a study that aimed to develop a system for mining opinions from tweets about traffic status. Over 5,000 traffic-related tweets were collected and preprocessed by removing stop words and punctuation. The tweets were manually labeled as expressing either positive (p) or negative (n) sentiment. Various classifiers were trained on 80% of the labeled data and validated. The top 7 performing classifiers were combined into an ensemble model, which achieved an F-measure of 87.15% when classifying the remaining 20% of tweets, indicating the system is effective at mining traffic sentiment from tweets.
A União Europeia está preocupada com o aumento da desinformação online e propôs novas regras para combater as notícias falsas. As novas regras exigiriam que as plataformas de mídia social monitorassem o conteúdo ativamente e removessem rapidamente qualquer conteúdo considerado falso ou enganoso que possa prejudicar a saúde pública ou a segurança. Algumas organizações expressaram preocupações sobre como as novas regras podem afetar a liberdade de expressão.
The author came to college to obtain qualifications for a higher paying job, as blue collar jobs only requiring a high school education offer unattractive salaries. They are seeking an education in social work and law to help the misfortunate and disabled. They chose to attend Albany State University because of its reputation for small class sizes that foster discussion, critical thinking, and networking. While college will be challenging, the author is committed to persisting through difficulties to pursue their career goals in social work and law.
Este documento describe el módulo mod_autoindex de Apache, el cual ayuda a crear listados automáticos de contenido de directorios del servidor. Explica las diferentes directivas como AddAlt, AddDescription, AddIcon, IndexOptions e IndexStyleSheet que permiten personalizar la apariencia y comportamiento de los listados de directorios generados automáticamente.
There are two main types of chemical bonds: ionic bonds and covalent bonds. Ionic bonds are the force of attraction between two oppositely charged ions in a molecule, such as in sodium fluoride (NaF). Covalent bonds are formed by the mutual sharing of electrons between two atoms in a molecule.
Diseño de sistemas de gestión documental orientados a procesos en el marco de...Jordi Serra Serra
Esta presentación describe la evolución de la gestión documental hacia un entorno de gobierno integrado de la información, y presenta una metodología basada en procesos para el diseño e implantación de sistemas de gestión documental.
This presentation describes the evolution from records management to an integrated information governance framework, and proposes a process based methodology for records and information management systems design and development.
Citación recomendada: Diseño de sistemas de gestión documental orientados a procesos en el marco del gobierno de la información. XXII Jornadas de Archivos Universitarios "La gestión de documentos en el entorno digital". Tarragona, 19 de mayo de 2016.
Kursus Kejurulatihan Permainan Tenis ini bertujuan untuk menambah pengetahuan guru mengenai peraturan dan teknik tenis serta membantu mereka menjadi jurulatih yang lebih mahir. Program ini akan diadakan pada 10 Januari 2014 di IPGK Kampus Temenggong Ibrahim dan akan dilengkapi dengan kursus, bengkel latihan, dan sumber daya untuk meningkatkan kemahiran mengajar tenis.
The document discusses the concept of data-driven cities and provides examples from several major cities. It finds that while there is no single definition of a data-driven city, key elements include the generation and analysis of data to improve living standards through social, economic, and environmental initiatives. The study analyzed 28 global cities and identified Moscow, New York, London, Barcelona, and Sydney as technological leaders due to their extensive use of data-driven solutions across areas like transportation, utilities, security and citizen engagement. While each city demonstrates strengths in various technologies and policy areas, New York stands out as the overall leader in terms of development and implementation of data-driven practices in urban management.
IRJET - Twitter Spam Detection using CobwebIRJET Journal
This document discusses using Cobweb clustering and Gradient Boosting techniques to detect spam on Twitter. Cobweb clustering creates a classification tree to predict attributes of new objects by summarizing the attribute distributions of existing nodes. Gradient Boosting is an ensemble method that uses multiple weak learners (decision trees) to create a stronger predictive model. The paper aims to combine these techniques to create an enhanced spam detection system. It also reviews several existing approaches for Twitter spam detection using techniques like Hidden Markov Models, Random Forests, and asynchronous link-based algorithms.
Testing Vitality Ranking and Prediction in Social Networking Services With Dy...reshma reshu
Social networking services have been prevalent at many online communities such as Twitter.com and Weibo.com, where millions of users keep interacting with each other every day. One interesting and important problem in the social networking services is to rank users based on their vitality in a timely fashion. An accurate ranking list of user vitality could benefit many parties in social network services such as the ads providers and site operators. Although it is very promising to obtain a vitality-based ranking list of users, there are many technical challenges due to the large scale and dynamics of social networking data .
Social Friend Overlying Communities Based on Social Network ContextIRJET Journal
This document discusses algorithms for detecting overlapping communities in social networks. It begins with an introduction to social networks and community detection. It then reviews various algorithms that have been proposed for detecting overlapping communities, including clique percolation methods, fuzzy detection algorithms, agent-based and dynamic algorithms, and more. It also discusses using these algorithms to recommend friends and locations to users based on their behaviors and communities within social networks. The document presents results from applying these algorithms and concludes by discussing opportunities for future work improving recommendation performance.
This document summarizes the creation of an OLAP cube analysis for an eBay company called ColoradoDirtCheap.com. The analysis was created using SQL Server tools to extract, transform, and load data from CSV files into a data warehouse. Dimensional models and cubes were then created in Analysis Services. Reports on key metrics like profit percentage and sales growth were developed in Reporting Services and distributed to management for analysis. The analysis found that adding a new 10-tube product listing led to a significant increase in quarterly sales growth.
Service Level Comparison for Online Shopping using Data MiningIIRindia
The term knowledge discovery in databases (KDD) is the analysis step of data mining. The data mining goal is to extract the knowledge and patterns from large data sets, not the data extraction itself. Big-Data Computing is a critical challenge for the ICT industry. Engineers and researchers are dealing with the cloud computing paradigm of petabyte data sets. Thus the demand for building a service stack to distribute, manage and process massive data sets has risen drastically. We investigate the problem for a single source node to broadcast the big chunk of data sets to a set of nodes to minimize the maximum completion time. These nodes may locate in the same datacenter or across geo-distributed data centers. The Big-data broadcasting problem is modeled into a LockStep Broadcast Tree (LSBT) problem. And the main idea of the LSBT is defining a basic unit of upload bandwidth, r, a node with capacity c broadcasts data to a set of [c=r] children at the rate r. Note that r is a parameter to be optimized as part of the LSBT problem. The broadcast data are further divided into m chunks. In a pipeline manner, these m chunks can then be broadcast down the LSBT. In a homogeneous network environment in which each node has the same upload capacity c, the optimal uplink rate r, of LSBT is either c=2 or 3, whichever gives the smaller maximum completion time. For heterogeneous environments, an O(nlog2n) algorithm is presented to select an optimal uplink rate r, and to construct an optimal LSBT. With lower computational complexity and low maximum completion time, the numerical results shows better performance.The methodology includes Various Web applications Building and Broadcasting followed by the Gateway Application and Batch Processing over the TSV Data after which the Web Crawling for Resources and MapReduce process takes place and finally Picking Products from Recommendations and Purchasing it.
The document presents a new greedy incremental approach for community detection in social networks. It begins by calculating the degree of nodes and sorting them in descending order. Initial communities are formed with the highest degree nodes. Then nodes are incrementally added to communities if it increases the community density. The approach is tested on standard datasets and able to detect communities reasonably well in less dense graphs. However, there is scope to improve performance on very dense graphs such as implementing it in parallel processing.
The document discusses social media security challenges related to cognition, cross-platform issues, and push algorithms. It covers topics like abuse targeting internal or external victims, security issues on social media, and the life cycle and influence of social media posts. Detection of multiple accounts and geolocation identification on social media are also summarized.
1. The document discusses market analytics using predictive modeling and clustering techniques in R. It provides background on CDAC and predictive analytics, describing how segmentation can group customers into clusters like inactive, new active, low-value active, and high-value active.
2. Common clustering algorithms for segmentation are discussed, including k-means and hierarchical agglomerative clustering. K-means requires specifying the number of clusters k, while HAC creates a dendrogram and requires deciding on merge points.
3. Limitations of clustering for market analytics include needing to update models frequently as new customer data is added and market data changes over time. Stability of clusters can also be an issue.
Profile Analysis of Users in Data Analytics DomainDrjabez
Data Analytics and Data Science is in the fast forward
mode recently. We see a lot of companies hiring people for data
analysis and data science, especially in India. Also, many
recruiting firms use stackoverflow to fish their potential
candidates. The industry has also started to recruit people based
on the shapes of expertise. Expertise of a personal is
metaphorically outlined by shapes of letters like I, T, M and
hyphen betting on her experiencein a section (depth) and
therefore the variety of areas of interest (width).This proposal
builds upon the work of mining shapes of user expertise in a
typical online social Question and Answer (Q&A) community
where expert users often answer questions posed by other
users.We have dealt with the temporal analysis of the expertise
among the Q&A community users in terms how the user/ expert
have evolved over time.
Keywords— Shapes of expertise, Graph communities, Expertise
evolution, Q&A community
The document discusses visual data mining techniques and their application to spatial data analysis. It proposes the CubeView system which uses a data cube structure to support data mining and visualization of large spatial datasets. CubeView allows selective visualization through spatial outlier detection algorithms that identify suspiciously deviating observations in the data. The system was applied to traffic data from road sensor networks to enable analysis of traffic patterns and outliers.
MINING OPINIONS ABOUT TRAFFIC STATUS USING TWITTER MESSAGESIAEME Publication
The document describes a system for mining opinions from tweets about traffic status. Around 5000 traffic-related tweets were collected and preprocessed by removing stop words and punctuation. The tweets were manually labeled as expressing positive (p) or negative (n) opinions. Various classifiers were trained on the labeled data and evaluated. The top 7 classifiers were combined into an ensemble model, which achieved an F-measure of 87.15% for classifying tweets in the test set, indicating the system is effective at mining opinions on traffic status from tweets.
The document describes a study that aimed to develop a system for mining opinions from tweets about traffic status. Over 5,000 traffic-related tweets were collected and preprocessed by removing stop words and punctuation. The tweets were manually labeled as expressing either positive (p) or negative (n) sentiment. Various classifiers were trained on 80% of the labeled data and validated. The top 7 performing classifiers were combined into an ensemble model, which achieved an F-measure of 87.15% when classifying the remaining 20% of tweets, indicating the system is effective at mining traffic sentiment from tweets.
This document analyzes data from online forums used by two software companies, Salesforce and SAP, to crowdsource ideas for new software features from customers. The analysis finds that a small core group of users in each forum are responsible for generating a large proportion of implemented ideas. Betweenness centrality is identified as an effective measure for identifying influential users. Commenting on ideas is found to be more effective than voting at fostering community formation among participants.
Finding prominent features in communities in social networks using ontologycsandit
Community detection is one of the major tasks in social networks. The success of any community
depends upon the features that were selected to form the community. So it is important to have
the knowledge of the main features that may affect the community. In this work we have
proposed a method to find prominent features based on which community can be formed.
Ontology has been used for the said purpose.
The document describes a project that uses k-means clustering to group local authorities in the UK that are statistically similar based on key metrics related to the UK government's 12 levelling up missions. The analysis found clusters of local authorities with higher/lower levels of health, well-being, connectivity, and educational performance. Future work may develop the analysis over multiple time periods and using additional datasets to understand outcomes based on demographic groupings. User feedback is sought on how similar groupings could best be utilized and presented.
The document summarizes the results of Roland Berger's global study on smart city strategies. It finds that while most cities have room for improvement in their smart city strategies, the top-scoring strategies come from Vienna, Chicago, and Singapore and excel in strategic planning, infrastructure, and action fields. It identifies best practices such as actively involving stakeholders, avoiding isolated solutions, and establishing coordination bodies. The document concludes by recommending 10 key points for cities to address when developing smart city strategies, such as reevaluating the city's role, encouraging private sector contributions, and ensuring data security.
The document describes a system that can automatically generate multiple choice questions from input text. It discusses summarizing the text, extracting keywords, mapping keywords to sentences, classifying sentence types, generating distractors, and algorithms for different question types including fill in the blank, true/false, and match the following. The system is implemented in Python and tested on various text sources to demonstrate its effectiveness.
Similar to INDIAN CITIES_RANKING BASED ON TWITTER FEEDS USING ADVANCED ANALYTICS (20)
INDIAN CITIES_RANKING BASED ON TWITTER FEEDS USING ADVANCED ANALYTICS
1.
2. 2
INTRODUCTION
With the development of new smart technologies, the world is going digital. The
increasing scope of the web and the large amount of electronic data piling up
throughout the web has provoked the exploration of hidden information from their
text content. Looking up for the precise and relevant information and extracting it
from the web has now become a time-consuming task. There are many techniques
used for the Web - information extraction and text mining is one of them.
Twitter is one of the famous social platforms with 316 M user in the world and 500
M tweets sent per day. In India, twitter has 22.2 M users (source :
http://www.huffingtonpost.in). With such a vast amount of data available, twitter
has been used as a source of unstructured data to perform varied Data analytics
insights.
Every year famous magazines publish "most livable cities in world" list and each city wants
to be the most livable to attract business and investments, boost local economies and real
estate markets. Here I focused on text mining techniques, k means algorithm and
classification to identify livable cities in India by categorizing the tweets-its sentiments using
different criteria which includes social and economic circumstances for residents, public
health, infrastructure, and ease and availability of local transport
3. 3
NOW , WHAT’S PRIMARY OBJECTIVE , SCOPE AND LIMITATION ?
The research Question : ‘how Livable is a city based on the comments and
views on twitter with the use of Text Mining’
Objectives are: To provide a dynamic algorithm with can label the twitter
feeds & reduce the complexity of ranking a city in different categories based
on twitter views
Scope and Limitations:
The Scope of this project is the Twitter Views on Indian Cities.
The Limitation is the dataset is that the feeds are from a single day
‘25/08/2015’
4. 4
THE FLOW
EXTRACT
OF
TWITTER
FEEDS
THE SEMMA APPROACH
.Json file
format
Converted
to .csv
Removal of
Duplicate
Texts
Language-
English
Pareto of
Top cities
Loading the
corpus
DTM
Creation
Stop-Word
Removal
Tokenizatio
n
Loading the
corpus
TERM
ANALYSIS
K-Means
Clustering
Labelling
based on
the Clusters
Classificatio
n of Tweets
based on
Labels
Result
1.SAMPLE 2.EXPLORE
4.MODEL
3.MODIFY
Results
Exploration
CITY
RANKING
RESULTS
CONCLUSIONS
5.ASSESS
5. 5
What are the Data Attributes?
The Dataset obtained post conversion from .JSON file format to .CSV 93762 records with 24 attributes related to each
Twitter Feed
From the list of 24 Attributes, selected 11 attributes to proceed with the project
S.no Attribute Name S.no Attribute Name
1 links 13 user_name
2 text 14 sentiment_type
3 topics 15 reach
4 application_rating 16 user_city
5 application_store 17 user_language
6 created_time 18 device
7 city 19 application_version
8 user_id 20 keyword
9 sentiment 21 language
10 application 22 country
11 engagement 23 uri
12 source 24 user_country
S.no Attributes Selected S.no Attributes Selected
1 text 6 keyword
2 created_time 7 language
3 sentiment 8 application
4 sentiment_type 9 user_country
5 reach 10 device
6. 6
What did the Explore Stage Result ?
Post Cleaning up of the Data, applied
the Pareto !
The Top 30 Cities resulted contributed
to 95% of the volume
On Further exploration
(post data processing)
found the Top Terms
with Frequencies
7. 7
Model Phase Results ?
There are various methods of clustering and K-means is one of the most efficient ways for clustering.
From the given set of n data, k different clusters; each cluster characterized with a unique centroid (mean) is
partitioned using the K-means algorithm. The elements belonging to one cluster are close to the centroid of that
particular cluster and dissimilar to the elements belonging to the other cluster.
Clustering was done to identify the clusters of terms in turn enabling us to label the data !
With this K-means clustering exercise
using Euclidian distance; 25 clusters
(best fit) were obtained
8. 8
Model Phase Results ? Contd…
The 25 clusters which have the list of
terms are extracted to a .csv file
The terms in each cluster are reviewed
manually and a Label is given to each
cluster
With this step, 6 Labels for the clusters
are identified:
Once the Labels are assigned to each cluster, all the labels (with the terms) as separated into individual text
documents for the purpose of classification. A number is assigned for each document and data frame is
created. A union of 2 lists (the Label lists and the Feed lists) is done. Post the Union of 2 lists is done, a
binding of each tweet with the label is performed > The Result is Every Tweet is labelled with the category
identified.
9. 9
Conclusions--- the final results !
The resulted extract has each Tweet labelled. There 25180 records in the final extract
Few tweets were labelled as ‘Others’ as they weren’t binding with any of the Labels. The ’Others’ label is excluded
Post the above step, there are 25140 records left
By City – the Count of Labelled Tweets:
The Top city is New Delhi followed by Mumbai and Bangalore
10. 10
Conclusions--- the final results !
It can identified that the maximum number of tweets
were on Lifestyle, Career and Infrastructure and are
mostly Neutral in nature
The above Tree map depicts the Label versus the
user reach. The Highest is for the Label ‘Lifestyle’
and the Least is for ‘law and Order’
11. 11
Conclusions --- the final results !
Out of the obtained set of Cities , Performed ranking for the Top 10 cities based on Pareto Rule
An Analysis of City versus the sentiment score, Category is performed. The Below outputs explain the ranking of the Cities:
New Delhi has the highest positive and negative scores
12. 12
Conclusions--- the final results !
OVERALL – ‘GURGAON’ is the most Livable city and ‘AHMEDABAD’ is the Least Likely
13. 13
Conclusions--- the final results !
CITY CAREER CRIME EDUCATION
ENVIRONMENT
& HEALTH
INFRASTRUCTURE
LAW &
ORDER
LIFESTYLE
Ahmedabad -2.08 -15.19 2.33 -2.13 -3.00 -26.81 12.19
Bangalore 71.85 -14.39 8.09 -0.48 -1.37 -6.63 82.81
Chennai 43.00 -12.56 1.82 8.00 8.24 0.33 56.50
Gurgaon 162.71 -1.83 4.14 9.55 35.49 -1.13 60.90
Hyderabad 25.47 -18.52 11.31 1.95 -6.61 0.85 20.05
Jaipur 2.09 -13.18 -1.60 1.33 2.15 3.38 40.54
Mumbai 40.02 -60.49 11.44 14.43 15.89 -10.01 54.42
New delhi 114.23 -90.98 9.09 17.83 47.63 -3.44 127.07
Pune 17.89 -4.00 6.93 4.38 8.25 -0.65 46.63
Salem 14.25 -3.01 1.75 2.76 2.73 -5.30 82.86
RANKING OF CITIES BY LABEL
The Above table gives a snapshot of scores of the City by Label
For CAREER – Gurgaon is the most likely City and Ahmedabad is the least likely City
For CRIME – New Delhi is highly prone to crime whereas Gurgaon is least prone
For EDUCATION – Mumbai and Hyderabad are the most likely Cities whereas Jaipur is least likely
For ENVIRONMENT & HEALTH – New Delhi is most likely and Ahmedabad is least likely
For INFRASTRUCTURE – New Delhi is most likely on Infrastructure & Ahmedabad, Hyderabad are least likely
For LAW & ORDER – Jaipur is high on Law & order whereas Ahmedabad is the least
For LIFESTYLE – New Delhi is most spoken for Lifestyle and Ahmedabad the least