Demonstration on how to perform classification and clustering. Selected application for this demo was Sentiment Analysis. First we try to build a Sentiment Classifier using TF-IDF as features with Linear kernel SVM as classifier. Then we perform clustering on the documents based on TF-IDF.
I conducted this demo for Information Retrieval lecture at Computer Science and Engineering, University of Moratuwa, Sri Lanka.
2. Dataset and Tools Required
โ Dataset
โ https://www.kaggle.com/c/si650winter11 (Training Dataset Only)
โ You will be able to submit a prediction using testing set.
โ Tools Required
โ Python 3.6 (or other)
โ Scikit-Learn Toolkit
โ NLTK (You will have to download โstopwordโ using nltk.dowload())
2
3. High Level Architecture
โ Goals
โ to classify the sentiment of each sentence into "positive" or "negative".
โ to identify clusters
3
Documents
Classify
Cluster
Cluster PolarityCombine
5. Step 1: Loading Dataset
def read_dataset():
with open('../resc/data/training.txt', 'r', encoding='utf-8') as f:
records = list(zip(*[line.split('t') for line in f.readlines()]))
return records[1], records[0]
train_text, train_labels = read_dataset()
5
6. Step 2: Extracting Features
โ We will try out TF-IDF features
from nltk import TweetTokenizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
stops = set(stopwords.words('english'))
6
7. Step 2: Extracting Features
โ We will try out TF-IDF features
kwargs = {
'encoding': 'utf-8',
'preprocessor': None,
'stop_words': stops,
'lowercase': True,
'tokenizer': TweetTokenizer().tokenize
}
tfidfVec = TfidfVectorizer(**kwargs)
X_train = tfidfVec.fit_transform(train_text)
# X_test = tfidfVec.transform(test_text)
X_train = X_train.toarray()
7
8. Step 4: Training the Classifier
โ Define the Classifier
โ Letโs create an SVC (Support Vector Classifier)
โ Training the classifier
svc = LinearSVC()
svc.fit(X_train, train_labels)
8
9. Step 4: Training the Classifier
โ Fix
ValueError: pos_label=1 is not a valid label: array(['0', '1'], dtype='<U1')
le = LabelEncoder()
y = le.fit_transform(train_labels)
svc = LinearSVC()
svc.fit(X_train, y_train)
โ Oops!
9
10. Step 5: Evaluation
โ 5-Fold Cross-Validation
โ Train / Test Split
scores = cross_val_score(svc, X, y, cv=5, scoring='f1')
X_train, X_test, y_train, y_test = train_test_split(X,
y, test_size=0.33, random_state=42, shuffle=True)
10
14. Clusters...
I really enjoyed the Da Vinci Code
but thought I would be disappointed
in the other books & # 8230;.
this was the first clive cussler i've
ever read, but even books like Relic,
and Da Vinci code were more
plausible than this.
Brokeback Mountain was amazing,
and made me cry like a bitch.
Brokeback Mountain is an excellent
movie, I love it after watching it!
The Da Vinci Code book is just
awesome.
i liked the Da Vinci Code a lot.
friday i stayed in & watched Mission
Impossible 3 which is amazing by the
way.
I LOVED Mission Impossible 3..
Da Vinci Code
Brokeback Mountain
Mission Impossible 14
15. Combining the two methods...
A simple approach would be toโฆ Find the percentage of positives for each cluster
15