Sentiment analysis over Twitter offers organisations and individuals a fast and effective way to monitor the publics' feelings towards them and their competitors. To assess the performance of sentiment analysis methods over Twitter a small set of evaluation datasets have been released in the last few years. In this paper we present an overview of eight publicly available and manually annotated evaluation datasets for Twitter sentiment analysis. Based on this review, we show that a common limitation of most of these datasets, when assessing sentiment analysis at target (entity) level, is the lack of distinctive sentiment annotations among the tweets and the entities contained in them. For example, the tweet ``I love iPhone, but I hate iPad'' can be annotated with a mixed sentiment label, but the entity iPhone within this tweet should be annotated with a positive sentiment label. Aiming to overcome this limitation, and to complement current evaluation datasets, we present STS-Gold, a new evaluation dataset where tweets and targets (entities) are annotated individually and therefore may present different sentiment labels. This paper also provides a comparative study of the various datasets along several dimensions including: total number of tweets, vocabulary size and sparsity. We also investigate the pair-wise correlation among these dimensions as well as their correlations to the sentiment classification performance on different datasets.
In recent times, research activities in the areas of Opinion and Sentiment analysis in natural language texts and other media are gaining ground under the umbrella of subjectivity analysis. The reason may be the huge amount of available text data in the Social Web in the forms of news, reviews, blogs, chats and even twitter. Though Sentiment analysis from natural lan-guage text is a multifaceted and multidisciplinary problem, in general, the term “sentiment” is used in reference to the automatic analysis of evaluative text.
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
https://www.youtube.com/watch?v=nvlHJgRE3pU
Won ITAC Graduation Projects Competition, ITAC ID: GP2015.R10.75
A web application that analyze big volumes of product reviews, social networks posts and tweets related to a given product. Then, present these results of this big data analytical job in a user friendly, understandable, and easily interpreted manner that can be used by different customers for different purposes.
Technologies used:
1- Hadoop
2- Hadoop Streaming
3- R Statistical
4- PHP
5- Google Charts API
In recent times, research activities in the areas of Opinion and Sentiment analysis in natural language texts and other media are gaining ground under the umbrella of subjectivity analysis. The reason may be the huge amount of available text data in the Social Web in the forms of news, reviews, blogs, chats and even twitter. Though Sentiment analysis from natural lan-guage text is a multifaceted and multidisciplinary problem, in general, the term “sentiment” is used in reference to the automatic analysis of evaluative text.
Make a query regarding a topic of interest and come to know the sentiment for the day in pie-chart or for the week in form of line-chart for the tweets gathered from twitter.com
https://www.youtube.com/watch?v=nvlHJgRE3pU
Won ITAC Graduation Projects Competition, ITAC ID: GP2015.R10.75
A web application that analyze big volumes of product reviews, social networks posts and tweets related to a given product. Then, present these results of this big data analytical job in a user friendly, understandable, and easily interpreted manner that can be used by different customers for different purposes.
Technologies used:
1- Hadoop
2- Hadoop Streaming
3- R Statistical
4- PHP
5- Google Charts API
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Prateek Singh
Sentiment mining paper presentation, database mining and business intelligence.
The Design and Implementation of an Internet PublicOpinion Monitoring and Analysing System
The big data phenomenon has confirmed the achievement of data access transformation. Sentiment analysis (SA) is one of the most exploited area and used for profit-making purpose through business intelligence applications. This paper reviews the trends in SA and relates the growth in the area with the big data era.
I created this presentation to present my research work to the committee. My research was on extracting tweets and analyzing it with an previously created ontology model. The results of the ontology model will help in identifying the domain area of the problem for which use had shared negative sentiments on tweeter. This system along with the ontology model developed for Postal service domain. The next step in research will be to generate automated responses on twitter to the user who shares negative sentiments.
SentiTweet is a sentiment analysis tool for identifying the sentiment of the tweets as positive, negative and neutral.SentiTweet comes to rescue to find the sentiment of a single tweet or a set of tweets. Not only that it also enables you to find out the sentiment of the entire tweet or specific phrases of the tweet.
Sentiment analysis of Twitter data using pythonHetu Bhavsar
Twitter is a popular social networking website where users posts and interact with messages known as “tweets”. To automate the analysis of such data, the area of Sentiment Analysis has emerged. It aims at identifying opinionative data in the Web and classifying them according to their polarity, i.e., whether they carry a positive or negative connotation. We will attempt to conduct sentiment analysis on “tweets” using various different machine learning algorithms.
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
Sentiment analysis - Our approach and use casesKarol Chlasta
I. Introduction to Sentiment Analysis and its applications.
II. How to approach Sentiment Analysis?
III. 2015 Elections in Poland on Twitter.com & Onet.pl.
GradTrack: Getting Started with Statistics September 20, 2018Nancy Garmer
Dr. Gary Burns, Professor, School of Psychology, Florida Institute of Technology Evans Library Introduction to Statistics: Don't be afraid
Video presentation with audio available on YouTube:http://bit.ly/GradTrackStatistics2018
Sentiment mining- The Design and Implementation of an Internet PublicOpinion...Prateek Singh
Sentiment mining paper presentation, database mining and business intelligence.
The Design and Implementation of an Internet PublicOpinion Monitoring and Analysing System
The big data phenomenon has confirmed the achievement of data access transformation. Sentiment analysis (SA) is one of the most exploited area and used for profit-making purpose through business intelligence applications. This paper reviews the trends in SA and relates the growth in the area with the big data era.
I created this presentation to present my research work to the committee. My research was on extracting tweets and analyzing it with an previously created ontology model. The results of the ontology model will help in identifying the domain area of the problem for which use had shared negative sentiments on tweeter. This system along with the ontology model developed for Postal service domain. The next step in research will be to generate automated responses on twitter to the user who shares negative sentiments.
SentiTweet is a sentiment analysis tool for identifying the sentiment of the tweets as positive, negative and neutral.SentiTweet comes to rescue to find the sentiment of a single tweet or a set of tweets. Not only that it also enables you to find out the sentiment of the entire tweet or specific phrases of the tweet.
Sentiment analysis of Twitter data using pythonHetu Bhavsar
Twitter is a popular social networking website where users posts and interact with messages known as “tweets”. To automate the analysis of such data, the area of Sentiment Analysis has emerged. It aims at identifying opinionative data in the Web and classifying them according to their polarity, i.e., whether they carry a positive or negative connotation. We will attempt to conduct sentiment analysis on “tweets” using various different machine learning algorithms.
Twitter has brought much attention recently as a hot research topic in the domain of sentiment analysis. Training sentiment classifiers from tweets data often faces the data sparsity problem partly due to the large variety of short and irregular forms introduced to tweets because of the 140-character limit. In this work we propose using two different sets of features to alleviate the data sparseness problem. One is the semantic feature set where we extract semantically hidden concepts from tweets and then incorporate them into classifier training through interpolation. Another is the sentiment-topic feature set where we extract latent topics and the associated topic sentiment from tweets, then augment the original feature space with these sentiment-topics. Experimental results on the Stanford Twitter Sentiment Dataset show that both feature sets outperform the baseline model using unigrams only. Moreover, using semantic features rivals the previously reported best result. Using sentiment-topic features achieves 86.3% sentiment classification accuracy, which outperforms existing approaches.
Sentiment analysis - Our approach and use casesKarol Chlasta
I. Introduction to Sentiment Analysis and its applications.
II. How to approach Sentiment Analysis?
III. 2015 Elections in Poland on Twitter.com & Onet.pl.
GradTrack: Getting Started with Statistics September 20, 2018Nancy Garmer
Dr. Gary Burns, Professor, School of Psychology, Florida Institute of Technology Evans Library Introduction to Statistics: Don't be afraid
Video presentation with audio available on YouTube:http://bit.ly/GradTrackStatistics2018
YouTube Presentation: http://bit.ly/GradTrackStatistics2018
Dr. Gary Burns, Professor, School of Psychology, Florida Institute of Technology, Evans Library GradTrack Workshop
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Eric Brown
In this presentation, I provide an overview of my research into using twitter sentiment and message volume as inputs into modeling stock price movements. A quick and dirty linear regression model using Twitter Sentiment, the Number of Tweets per day, the VIX Closing price and the VIX Price change delivers a simple model for the S&P 500 SPY ETF that has an accuracy of 57% over 6 months (tested on out-of sample data). This model was built using data from July 11 2011 to August 11 2011.
Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories.
Why Data Mining?
What Is Data Mining?
Data Mining: On What Kind of Data?
Data Classification
What is Sentiment Classification?
Importance of Sentiment classification
Twitter for Sentiment Classification
Problem Statement
Goal of this Classifications
Method to be used
Conclusion
Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good.
Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data.
To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.
DESCRIPTIVE ANALYSIS
1
DESCRIPTIVE ANALYSIS
8
Examining Measurements of Central Tendencies
Examining Measurements of Central Tendencies
This discussion board is based on the measurement of central tendencies whereas the nominal, ordinal, interval and ratio allow researcher to analyze data. Each of these measurements provide researchers with the ability to measure sets of data that do not represent numerical values. Salkind (2017) defined a level measurement with an outcome that fit into one and only class or category as nominal. The level of measurement assigns value to a specific item than assign a value to the item based on the appeal to an individual. The nominal measurement that I chose was labor force status. The descriptive characteristics that were chosen for the completion of the data set were represented some form of employment. Salkind (2017) explained the ordinal measurement as the characteristic of the assigning order or ranking data. The ordinal measurement that I chose was a ranking of how individuals view their political affiliations. The characteristics were assigned a value which for the mean, median and mode to be determined. The sum of a data set divide by the number data points represents the mean (Salkind, 2017). The mean for a data set may be skewed based on extreme number contained in the set of number. By focusing on the median, Salkind (2017) defined as a true midpoint of the data set that does not take in consideration extreme number. The median produces a more conclusive number that is related to the true data without influences. When analyzing data, situations may occur where the data is repetitive. This repetition of the number in a data is known as the mode (Salkind, 2017). A data set may have multiple modes and may have greater determining factor mean and how the data is interpreted.
Nominal Data
The nominal data set for ‘Labor for status’ comprised of 10 descriptive terms that represents some phase of employment. The data were assigned numbers 0 to 9 based on the stage of employed (e.g. “working fulltime” =1). The data set consisted of 575 respondents of which only one data was missing. The data shows that nearly 60% of respondents reported that were “working fulltime”. The corresponding value associated with “working fulltime” was 1. The data show that most respondents are employed in some fashion calculating a mean of 2.57, median of 1 and a mode of 1. The median of 1 seems to be an anomaly in the data based on the data set range of nine. The standard deviation of 2.246 and variance of 5.044. Based on the information analyzed, 68% of the respondents are represented between .33 and 4.81. The variance shows the consistency of the data based on the distance from .33 to 4.81.
Statistics
Labor force status
N
Valid
574
Missing
1
Mean
2.57
Std. Error of Mean
.094
Median
1.00
Mode
1
Std. Deviation
2.246
Variance
5.044
Skewness
1.088
Std. Error of Skewness
.102
Kurtosis
-.392
Std..
DESCRIPTIVE ANALYSIS
1
DESCRIPTIVE ANALYSIS
8
Examining Measurements of Central Tendencies
Examining Measurements of Central Tendencies
This discussion board is based on the measurement of central tendencies whereas the nominal, ordinal, interval and ratio allow researcher to analyze data. Each of these measurements provide researchers with the ability to measure sets of data that do not represent numerical values. Salkind (2017) defined a level measurement with an outcome that fit into one and only class or category as nominal. The level of measurement assigns value to a specific item than assign a value to the item based on the appeal to an individual. The nominal measurement that I chose was labor force status. The descriptive characteristics that were chosen for the completion of the data set were represented some form of employment. Salkind (2017) explained the ordinal measurement as the characteristic of the assigning order or ranking data. The ordinal measurement that I chose was a ranking of how individuals view their political affiliations. The characteristics were assigned a value which for the mean, median and mode to be determined. The sum of a data set divide by the number data points represents the mean (Salkind, 2017). The mean for a data set may be skewed based on extreme number contained in the set of number. By focusing on the median, Salkind (2017) defined as a true midpoint of the data set that does not take in consideration extreme number. The median produces a more conclusive number that is related to the true data without influences. When analyzing data, situations may occur where the data is repetitive. This repetition of the number in a data is known as the mode (Salkind, 2017). A data set may have multiple modes and may have greater determining factor mean and how the data is interpreted.
Nominal Data
The nominal data set for ‘Labor for status’ comprised of 10 descriptive terms that represents some phase of employment. The data were assigned numbers 0 to 9 based on the stage of employed (e.g. “working fulltime” =1). The data set consisted of 575 respondents of which only one data was missing. The data shows that nearly 60% of respondents reported that were “working fulltime”. The corresponding value associated with “working fulltime” was 1. The data show that most respondents are employed in some fashion calculating a mean of 2.57, median of 1 and a mode of 1. The median of 1 seems to be an anomaly in the data based on the data set range of nine. The standard deviation of 2.246 and variance of 5.044. Based on the information analyzed, 68% of the respondents are represented between .33 and 4.81. The variance shows the consistency of the data based on the distance from .33 to 4.81.
Statistics
Labor force status
N
Valid
574
Missing
1
Mean
2.57
Std. Error of Mean
.094
Median
1.00
Mode
1
Std. Deviation
2.246
Variance
5.044
Skewness
1.088
Std. Error of Skewness
.102
Kurtosis
-.392
Std..
Presentation to the second LIS DREaM workshop held at the British Library on Monday 30th January 2012.
More information available at: http://lisresearch.org/dream-project/dream-event-3-workshop-monday-30-january-2012/
Digging for data: opportunities and challenges in an open research landscape_...Platforma Otwartej Nauki
“Open Research Data: Implications for Science and Society”, Warsaw, Poland, May 28–29, 2015, conference organized by the Open Science Platform — an initiative of the Interdisciplinary Centre for Mathematical and Computational Modelling at the University of Warsaw. pon.edu.pl @OpenSciPlatform #ORD2015
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new dataset, the STS-Gold
1. Evaluation Datasets for Twitter Sentiment Analysis
A survey and a new dataset, the STS-Gold
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
1st Workshop on Emotion and Sentiment in Social and
Expressive Media Approaches and perspectives from AI
2. • Definition & Background
• Evaluation Datasets for Twitter Sentiment
Analysis
• STS-Gold
Outline
• Comparative Study
• Conclusion
3. Sentiment Analysis – Definition
Sentiment Analysis
“Sentiment analysis is the task of identifying
positive and negative opinions, emotions and
evaluations in text”
The main dish was
delicious
It is a Syrian dish
Positive
Neutral
The main dish was
salty and horrible
Negative
3
5. Evaluation Datasets for Twitter Sentiment Analysis
SA Level
SA Task
No. of Tweets
Construction & Annotation
Dataset
Dataset
Vocabulary Size
Class Distribution
Sparsity
7. • Details about the annotation
methodology (STS, HCR, Sanders)
What is Missing?
• Entity-level Sentiment Evaluation:
• Most works are focused on
assessing the performance of
sentiment classifiers at the tweet
level (STS, OMD, SS-Tweet, Sanders)
• Datasets, which allow for the
sentiment evaluation at the entity
level, assign similar sentiment
labels to the tweet and the entities
within it. (HCR, WAB, GASP)
8. Enables the evaluation at both the entity and tweet
levels
Tweets and entities are annotated independently
Contains 58 Entities & 3000 Tweets
9. Data Collection
STS Corpus
Select
28 Entities
Select
100 Tweet/Entity
180K Tweets
STS-Gold
Alchemy API
2800 Tweets
Entity-Extraction
+200 tweets
Identify Frequent
Concepts
3000 Tweets
Top & Mid
Frequent Entities
Entity-Extraction
147 Entities
13. Comparative Study.1
Vocabulary Size vs. No. of Tweets
- There exists a high correction between the vocabulary size and the number of
tweets (ρ = 0.95)
- However, increasing the number of tweets does not always lead to increasing the
vocabulary size. (OMD)
14. Data Spar sity
Comparativeimportant factor that affectstheov
Da s t s rs isa Study.2
ta e pa ity
n
-
m chinele rning cla s rs[17]. According toS if e a
a
a
s ifie
a t l.
tha
nothe type
r
sof da
ta(e m
.g., oviere w da ) duetoa
vie
ta
Data Sparsity in tweets.
words
Inthiss ction, wea
e
imtocom rethepre e dda s ts
pa
s nte ta e
Twitter datasets are generally tethes rs de eof agive
Toca
lculavery sparse ity gre
pa
nda s t weus
ta e
e
Increasing both the number of tweets or the vocabulary size increases the sparsity
[13]:
Pn
degree of the dataset:
- ρno_of_tweets = 0.71
i Ni
Sd = 1 −
- ρvocabulary_size = 0.77
n ⇥ |V |
Whe
reN i isthethenum r of dis
be
tinct wordsintwe t i
e
the dataset and |V | the vocabulary size.
9
The Twe tNLP toke r ca be downloa d from ht t p:
e
nize n
de
Tweet NLP/
15. Comparative Study.3
Classification Performance vs. Dataset Sparsity (1)
0.9
Average Classifier Performance
Average Classifier Performance
According to Makrehchi et al (2008) and Saif et al (2012): in a given dataset the
classification performance and the sparsity degree are negatively correlated, i.e.,
increasing the dataset sparsity hinders the classification performance.
228
M . M akrehchi and M .S. K amel
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Industry Sectors
20 newsgroups
Reuters
0.991 0.992 0.993 0.994 0.995 0.996 0.997 0.998 0.999
Average Sparsity
(a)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.9441
Industry Sectors
20 newsgroups
Reuters
0.9550
0.9661
0.9772
0.9886
1.00
0.9441
0.9550
Average Sparsity
(b)
F i g. 2. Classifier performance as a funct ion of sparsity: (a) Rocchio, and (b) SV M
16. Comparative Study.3
Classification Performance vs. Dataset Sparsity (2)
- No correlation between the classification performance and the sparsity degree
across the datasets. (ρacc = −0.06, ρf1 = 0.23)
- The sparsity-performance correlation is intrinsic, meaning that it might exists within
the dataset itself, but not necessarily across the datasets.
17. • Current datasets to evaluate Twitter
sentiment classifiers:
– Focus on the tweet-level.
– Assign similar sentiment labels to the
tweets and the entities within them.
• STS-Gold allows for sentiment evaluation
as both the tweet and the entity levels.
• A correlation between the vocabulary size
and the number of tweets does not
always exist.
• The sparsity-performance correlation is
intrinsic, i.e., it only exists within the
dataset itself, but not across the different
datasets.
Conclusion!