Sentiment Analysis
Demonstration: Classification & Clustering
Yasas Senarath - Information Retrieval
Dataset and Tools Required
● Dataset
○ https://www.kaggle.com/c/si650winter11 (Training Dataset Only)
○ You will be able to submit a prediction using testing set.
● Tools Required
○ Python 3.6 (or other)
○ Scikit-Learn Toolkit
○ NLTK (You will have to download ‘stopword’ using nltk.dowload())
2
High Level Architecture
● Goals
○ to classify the sentiment of each sentence into "positive" or "negative".
○ to identify clusters
3
Documents
Classify
Cluster
Cluster PolarityCombine
(Polarity)
Classification
4
Step 1: Loading Dataset
def read_dataset():
with open('../resc/data/training.txt', 'r', encoding='utf-8') as f:
records = list(zip(*[line.split('t') for line in f.readlines()]))
return records[1], records[0]
train_text, train_labels = read_dataset()
5
Step 2: Extracting Features
● We will try out TF-IDF features
from nltk import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
stops = set(stopwords.words('english'))
6
Step 2: Extracting Features
● We will try out TF-IDF features
kwargs = {
'encoding': 'utf-8',
'preprocessor': None,
'stop_words': stops,
'lowercase': True,
'tokenizer': TweetTokenizer().tokenize
}
tfidfVec = TfidfVectorizer(**kwargs)
X_train = tfidfVec.fit_transform(train_text)
# X_test = tfidfVec.transform(test_text)
X_train = X_train.toarray()
7
Step 4: Training the Classifier
● Define the Classifier
○ Let’s create an SVC (Support Vector Classifier)
● Training the classifier
svc = LinearSVC()
svc.fit(X_train, train_labels)
8
Step 4: Training the Classifier
● Fix
ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1')
le = LabelEncoder()
y = le.fit_transform(train_labels)
svc = LinearSVC()
svc.fit(X_train, y_train)
● Oops!
9
Step 5: Evaluation
● 5-Fold Cross-Validation
● Train / Test Split
scores = cross_val_score(svc, X, y, cv=5, scoring='f1')
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.33, random_state=42, shuffle=True)
10
Clustering
11
Step 1: Training the Clustering Algorithm
NUM_CLUSTERS = 4
kmeans = KMeans(
n_clusters=NUM_CLUSTERS,
random_state=0
)
kmeans.fit(X)
12
Step 2: Evaluating Clusters
labels = kmeans.labels_
score = silhouette_score(X, labels)
print('Silhouette Score: {}'.format(score))
13
Clusters...
I really enjoyed the Da Vinci Code
but thought I would be disappointed
in the other books & # 8230;.
this was the first clive cussler i've
ever read, but even books like Relic,
and Da Vinci code were more
plausible than this.
Brokeback Mountain was amazing,
and made me cry like a bitch.
Brokeback Mountain is an excellent
movie, I love it after watching it!
The Da Vinci Code book is just
awesome.
i liked the Da Vinci Code a lot.
friday i stayed in & watched Mission
Impossible 3 which is amazing by the
way.
I LOVED Mission Impossible 3..
Da Vinci Code
Brokeback Mountain
Mission Impossible 14
Combining the two methods...
A simple approach would be to… Find the percentage of positives for each cluster
15
16

Twitter sentiment analysis

  • 1.
    Sentiment Analysis Demonstration: Classification& Clustering Yasas Senarath - Information Retrieval
  • 2.
    Dataset and ToolsRequired ● Dataset ○ https://www.kaggle.com/c/si650winter11 (Training Dataset Only) ○ You will be able to submit a prediction using testing set. ● Tools Required ○ Python 3.6 (or other) ○ Scikit-Learn Toolkit ○ NLTK (You will have to download ‘stopword’ using nltk.dowload()) 2
  • 3.
    High Level Architecture ●Goals ○ to classify the sentiment of each sentence into "positive" or "negative". ○ to identify clusters 3 Documents Classify Cluster Cluster PolarityCombine
  • 4.
  • 5.
    Step 1: LoadingDataset def read_dataset(): with open('../resc/data/training.txt', 'r', encoding='utf-8') as f: records = list(zip(*[line.split('t') for line in f.readlines()])) return records[1], records[0] train_text, train_labels = read_dataset() 5
  • 6.
    Step 2: ExtractingFeatures ● We will try out TF-IDF features from nltk import TweetTokenizer from nltk.corpus import stopwords from sklearn.feature_extraction.text import TfidfVectorizer stops = set(stopwords.words('english')) 6
  • 7.
    Step 2: ExtractingFeatures ● We will try out TF-IDF features kwargs = { 'encoding': 'utf-8', 'preprocessor': None, 'stop_words': stops, 'lowercase': True, 'tokenizer': TweetTokenizer().tokenize } tfidfVec = TfidfVectorizer(**kwargs) X_train = tfidfVec.fit_transform(train_text) # X_test = tfidfVec.transform(test_text) X_train = X_train.toarray() 7
  • 8.
    Step 4: Trainingthe Classifier ● Define the Classifier ○ Let’s create an SVC (Support Vector Classifier) ● Training the classifier svc = LinearSVC() svc.fit(X_train, train_labels) 8
  • 9.
    Step 4: Trainingthe Classifier ● Fix ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1') le = LabelEncoder() y = le.fit_transform(train_labels) svc = LinearSVC() svc.fit(X_train, y_train) ● Oops! 9
  • 10.
    Step 5: Evaluation ●5-Fold Cross-Validation ● Train / Test Split scores = cross_val_score(svc, X, y, cv=5, scoring='f1') X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, shuffle=True) 10
  • 11.
  • 12.
    Step 1: Trainingthe Clustering Algorithm NUM_CLUSTERS = 4 kmeans = KMeans( n_clusters=NUM_CLUSTERS, random_state=0 ) kmeans.fit(X) 12
  • 13.
    Step 2: EvaluatingClusters labels = kmeans.labels_ score = silhouette_score(X, labels) print('Silhouette Score: {}'.format(score)) 13
  • 14.
    Clusters... I really enjoyedthe Da Vinci Code but thought I would be disappointed in the other books & # 8230;. this was the first clive cussler i've ever read, but even books like Relic, and Da Vinci code were more plausible than this. Brokeback Mountain was amazing, and made me cry like a bitch. Brokeback Mountain is an excellent movie, I love it after watching it! The Da Vinci Code book is just awesome. i liked the Da Vinci Code a lot. friday i stayed in & watched Mission Impossible 3 which is amazing by the way. I LOVED Mission Impossible 3.. Da Vinci Code Brokeback Mountain Mission Impossible 14
  • 15.
    Combining the twomethods... A simple approach would be to… Find the percentage of positives for each cluster 15
  • 16.