Penerapan Text Mining
menggunakan Python
Andreas Chandra
linkedin.com/in/chandraandreas
Content
● Pengenalan Text Mining
● Bekerja dengan Regular Expression
● Dasar - dasar Natural Language Processing
● Text Classification
● Topic Modeling
Pengenalan Text Mining
Background
Hal yang bisa dilakukan dengan teks
● Menguraikan teks
● Ekstrak informasi dari teks
● Mengklasifikasi dokumen teks
● Mencari teks dokumen yang relevan
● Analisis sentimen
Hirarchy
Dokumen
Kalimat
Kata
Karakter
Python - string functions
● len(text)
● word.istitle()
● word.startwith(<letter>)
● word.endwith(<letter>)
● set(text_list)
Python - String Operations
● string.lower(); string.upper(); string.titlecase()
● string.split(<spliter>)
● string.join()
● string.strip()
● string.find(<word>)
Bekerja Dengan RegEx
Regular Expression
Excercise
String1 = “#indonesia hebat”
String 2=”@DKIJakarta macet banget nih”
String4 =””
Regular Expression
. : match any char
^: start of a string
$: end of string
[]: matches one of the set of char within []
[a-z]: matches one of the range of chars a,b,c,d, …, z
Dasar - dasar Natural Language
Processing
Definisi
● Kemampuan untuk memahami bahasa manusia
● Memahami bahasa manusia untuk mendapatkan informasi tentang kata-kata dan bagaimana
memahami struktur bahasa manusia
NLP Goals
● Menghitung kata
● Menemukan batas kalimat
● POS Tagging
● Menguraikan struktur kalimat (S + P + O + K)
● Mengidentifikasi sematic roles
● Mengidentifikasi entitas dalam kalimat
● Menemukan kata kepunyaan mana yang dimaksudkan ke entitas
NLTK
● Toolkit
● Open source
● Wrapper of scikit learn for NLP
● Terdapat beberapa korpus populer
Python
● Import nltk
● ...
NLP in Action
● nltk.word_tokenize()
● nltk.FreqDist()
Lemma VS Stem
What is Lemma?
What is Stem?
POS Tagging
In nutshell: mengidentifikasi apakah kata tersebut merupakan kata kerja, kata benda, kata keterangan,
dll
Python
text = “You should go now”
text2 = nltk.word_tokenize(text)
nltk.pos_tag(text2)
Issues in POS Tagging
---> Go to blackboard
Text Classification
Kategori mana yang tepat untuk teks dibawah ini?
http://ekonomi.kompas.com/read/2017/10/25/102555326/apbn-2018-diharapkan-bisa-menjadi-sentimen-positif
● Pembangunan
● Keuangan
● Politik
Penggunaan Text Classification
● Analisis Sentimen: apakah review film ini negatif atau positif
● Deteksi Spam: apakah email ini spam atau bukan?
● Identifikasi Topik: apakah berita ini topik teknologi, olahraga atau kesehatan?
● Spelling correction: bener atau benar?
Supervised Learning
● Machine learning task
● Labeled training data
● Learning from labeled data
Hal mendasar dalam klasifikasi
● Binary Classification
● Multi-class Classification
● Multi-label Classification
Text Features
● Kata
○ Most common words
○ Stop words
○ Normalization
○ Stemming / Lemmatization
● Kalimat
○ Pos Tagging
○ Struktur grammar
○ Kata yang similar
Algoritma
1. Naive Bayes Classifier
2. Support Vector Machine
Naive Bayes
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
Naive Bayes
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)
P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)=
9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has
higher probability.
https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/
Support Vector MAchine
Python
Jupyter Notebook
Install Scikit-learn
from sklearn import <algorithm>
Python -Naive Bayes
From sklearn import naive_bayes
NB = naive_bayes.MultinomialNB()
Model = NB.fit(train_data, train_labels)
Predict = NB.predict(test_data)
metrics.f1_score(test_labels, predict, average = ‘micro’)
Python - SVM
From sklearn import svm
svm = svm.SVC(kernel = ‘linear’, C=0.1)
model = svm.fit(train_data, train_labels)
predict = model.predict(test_data)
metrics.f1_score(test_labels, predict, average = ‘micro’)
Supervised algorithms in NLTK
● NaiveBayesClassifier
● DecisionTreeClassifier
● ConditionalExponentialClassifier
● MaxentClassifier
● WekaClassifier
● SklearnClassifier
Topic Modelling
Topic Modeling
● Semantic text similarity
● Topic Modeling
Penggunaan Semantic Similarity
● Semantic similarity is the practical, widely used approach to address
the natural language understanding issue in many core NLP tasks such
as paraphrase identification, Question Answering, Natural Language
Generation, and Intelligent Tutoring Systems
Semantic Similarity
Wordnet
https://www.coursera.org/learn/python-text-mining
Pengukuran similaritas
● Path Similarity
● Lowest Common subsumer
● Lin Similarity
Topic Modeling
● Discovering hidden topical patterns that are present across the
collection
● Annotating documents according to these topics
● Using these annotations to organize, search and summarize texts
Implementasi
● Latent Dirichlet Allocation
● TextRank
https://www.kdnuggets.com/2016/07/text-mining-101-topic-modeling.html

Penerapan text mining menggunakan python