In this workshop, we covered:
- How to code using Python on Jupyter Notebook
- What is NLP (Natural Language Processing) and how it is useful to analyze sentiments of a given text
- Hands-on coding exercises on how to use Python and its libraries to perform tokenization, stopwords removal, stemming and lemmatization, and POS tagging.
2. Pre-Workshop Checklist
⬡ 1. This is pretty obvious…. but do you have your laptop with
you? If you don’t…. Perhaps go grab it?
⬡ 2. Did you download Anaconda?
⬡ 3. Did you have access to TD WS 2 Shared Folder?
⬡ 4. If you say “no” to questions 2 and 3 → go to AIS website for
the instruction!
3. AIS Upcoming Events
⬡ No Speaker Series, Next Monday, 10/21
⬡ EY Office VIsit - Next Thursday, 10/24, 9:00AM - 12:00PM
∙ Find the signup in the newsletter
⬡ PD Meeting: Friday, 10/25, 12:00 – 12:50
∙ Talking Tech with Ilya Rogov
4. Hello!
I am Michelle Purnama
I hope you’re all excited to learn
Python with us! Don’t be scared -
this Python won’t bite :)
4
6. Python
⬡ Python is an interpreted, high-
level, general-purpose
programming language
⬡ It supports the use of modules
and packages
⬡ Code can be reused in a variety
of projects by importing and
exporting these modules
6
This Python ?
Or this
Python ?
9. Anaconda
⬡ Free and open-source
distribution of Python and R
programming languages that
aims to simplify package
management & deployment
⬡ In this workshop, we are using
Anaconda to install Python and
Jupyter Notebook
9
10. Jupyter Notebook
⬡ Open-source web application that
allows you to create & share
documents that contain live code,
equations, visualization and narrative
text
⬡ Powerful way to iterate our Python
code and writing lines of code and
running them one at a time
10
12. Text Analytics & NLP
⬡ Day-to-day texts generated are unstructured
⬡ NLP - Natural Language Processing
⬡ NLP enables computer to interact with humans in a
natural manner
⬡ Example: analyzing movie review
12
13. Text Analytics Operations using NLTK
⬡ NLTK - Natural Language Toolkit
⬡ Python package that provides a set of diverse
natural languages algorithms
⬡ Free, open source, easy to use, well documented
⬡ Helps computer analyze, preprocess, and understand
written text
13
16. Tokenization
⬡ First step in text analytics
⬡ The process of breaking down a text
paragraph into smaller chunks such as
words or sentences
⬡ Token - a single entity that is building
blocks for sentence or paragraph
⬡ nltk.tokenize - a module inside NLTK
package
16
17. Sentence Tokenization
⬡ Breaks text paragraph
into sentences
⬡ Import sent_tokenize
Sentence & Word Tokenization
17
Word Tokenization
⬡ Breaks text paragraph
into words
⬡ Import word_tokenize
18. Frequency Distribution
⬡ Frequency of occurrence
of each word in a text
⬡ Import FreqDist from
nltk.probability module
⬡ Import matplotlib
package to plot the
chart
18
19. Do It
Yourself!
Choose any story from the
Funny Halloween Stories link
and plot a frequency
distribution using Python! Boo!
19
21. Stopwords
⬡ Noise in the text
⬡ Examples: is, am, are, this, a, an, the
⬡ We need to create a list of stopwords and filter out our
list of tokens from these words
21
Wow,
that’s a
mouthful
22. Do It
Yourself!
Use the same story you picked
in Phase 1 and remove the
stopwords from that text. Let’s
do it!
22
24. Lexicon Normalization
⬡ Reduces derivationally related forms of a word to a
common root word
⬡ For example, connection, connected, connecting
word reduce to a common word “connect”
24
25. Stemming
⬡ Reduces word to their
root word / chops off the
derivational affixes
⬡ Does not recognize the
knowledge of the word in
context
Stemming & Lemmatization
25
Lemmatization
⬡ More sophisticated
⬡ Reduces words to their
base word - linguistically
correct lemmas
⬡ Considers context of the
word
27. POS Tagging
⬡ Part-of-Speech (POS) tagging looks to identify the
grammatical group of a given word based on the
context
⬡ For example, noun, pronoun, adjective, verb, adverbs,
etc
27
28. Do It
Yourself!
Choose a sentence from the
Halloween Story and apply
POS tags to the tokenized
sentence!
28
30. Text Classification
⬡ Important task in text mining
⬡ Identifying category/class of given text such as blog,
book, web page, tweets
⬡ Various application in spam detection, classifying
website content for a search engine, sentiments of
customer feedback, etc
30
34. What We’ve Learned Today..
34
⬡ Break down paragraphs into smaller chunks
⬡ Remove punctuation and stopwords to eliminate noise
⬡ Use Stemming & Lemmatization to reduce words to their
base words
⬡ Understand Part-of-Speech tagging
⬡ Create simple graphs in Python
⬡ Scratch a bit of the surface of Sentiment Text Analysis!
37. Additional Learning Resources
⬡ To read more about Text Analysis
∙ https://monkeylearn.com/text-analysis/
⬡ More advanced Text Analysis tutorial
∙ https://www.dataquest.io/blog/tutorial-text-analysis-python-test-
hypothesis
⬡ Bootcamp course on Python
∙ https://www.udemy.com/course/complete-python-bootcamp/
37
Python is an interpreted, high-level, general-purpose programming language.
Python supports the use of modules and packages, which means that programs can be designed in a modular style and code can be reused across a variety of projects. Once you've developed a module or package you need, it can be scaled for use in other projects, and it's easy to import or export these modules.
What is Anaconda?
Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing, that aims to simplify package management and deployment.
In this workshop, we will use Anaconda to Install Python and Jupyter Notebook as Anaconda also includes other commonly used packages for scientific computing and data science (and in this case, for text analytics!)
M
What is Jupyter Notebook?
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
Jupyter Notebooks are a powerful way to write and iterate on your Python code for data analysis. Rather than writing and re-writing an entire program, you can write lines of code and run them one at a time.
N
NLP enables the computer to interact with humans in a natural manner.
It helps the computer to understand the human language and derive meaning from it.
Analyzing movie review is one of the classic examples to demonstrate a simple NLP Bag-of-words model, on movie reviews.
N
NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, opensource, easy to use, large community, and well documented. NLTK helps the computer to analysis, preprocess, and understand the written text.
Going back to the phase slide, NLTK consists of the most common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic segmentation, and named entity recognition.
M
Talk about package —> module —> class (draw venn diagram on white board maybe?)
# Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()
https://matplotlib.org/tutorials/introductory/pyplot.html#pyplot-tutorial
M
N
Lexicon normalization considers another type of noise in the text.
For example, connection, connected, connecting word reduce to a common word "connect".
It reduces derivationally related forms of a word to a common root word.
Stemming
process of linguistic normalization
reduces words to their word root word or chops off the derivational affixes.
Lemmatization
more sophisticated than stemming.
reduces words to their base word, which is linguistically correct lemmas.
A lemma is a word that stands at the head of a definition in a dictionary. All the head words in a dictionary are lemmas. Technically, it is "a base word and its inflections"
Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.
N
The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word.
Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context.
POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.
List of POS tags:
https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b
Text classification is one of the important tasks of text mining.
Identifying category or class of given text such as a blog, book, web page, news articles, and tweets.
It has various application in today's computer world such as spam detection, task categorization in CRM services, categorizing products on E-retailer websites, classifying the content of websites for a search engine, sentiments of customer feedback, etc.
What users and the general public think about the latest feature?
You can quantify such information with reasonable accuracy using sentiment analysis.
Quantifying users content, idea, belief, and opinion is known as sentiment analysis.
Human communication is just not limited to words, it is more than words. Sentiments are combination words, tone, and writing style.
Two approaches
Lexicon-based: Count a number of positive and negative words in given text and the larger count will be the sentiment of text.
Machine learning based approach: Develop a classification model, which is trained using the pre-labeled dataset of positive, negative, and neutral.
In this Tutorial, you will use the second approach(Machine learning based approach). This is how you learn sentiment and text classification with a single example.
Break down paragraphs into smaller chunks like sentences or words.
Remove punctuation and stopwords to increase the accuracy of our analysis.
Use Stemming or Lemmatization to reduce words to their base words.
Understand Part-of-Speech tagging.
Create simple graphs in Python.
Scratch a bit of the surface of Sentiment Text Analysis!