SlideShare a Scribd company logo
1 of 33
Download to read offline
Natural Language Processing Basics
Tobias Deußer, Dr. Rafet Sifa
tobias.deusser@iais.fraunhofer.de
10/11/2022
Advanced Methods for Text Mining
Agenda
1. What is Natural Language Processing
2. Preprocessing – Theory
3. Getting Started
©Tobias Deußer 1 / 32
Natural Language Processing – The Wikipedia Definition
From en.wikipedia.org/wiki/Natural_language_processing
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural language
data. The goal is a computer capable of “understanding” the contents of documents, including
the contextual nuances of the language within them. The technology can then accurately extract
information and insights contained in the documents as well as categorize and organize the
documents themselves.
©Tobias Deußer 2 / 32
Modern NLP in a nutshell
©Tobias Deußer 3 / 32
The usual NLP workflow / pipeline

Data gathering
 Text parsing

Text preprocessing

Vectorization / featurization / embedding
 A downstream task
©Tobias Deußer 4 / 32
Downstream tasks
 Classification, e.g. sentiment or ratings

Information extraction
 Named Entity Recognition

Relation Extraction

Natural Language Inference (NLI)
 Text generation

Image generation

…
©Tobias Deußer 5 / 32
Downstream tasks – Sentiment Analysis
Figure 1: Classifying text into sentiment, figure taken from Socher et al. 2013
©Tobias Deußer 6 / 32
Downstream tasks – Information Extraction
Example from Deußer et al. 2022
“In 2021 and 2020 the total net revenue
kpi
was $100
cy
million and $80
py
million,
respectively.”
total net revenue
kpi
− 100
cy
, total net revenue
kpi
− 80
py
,
©Tobias Deußer 7 / 32
Downstream tasks – Natural Language Inference
Example from Pielka et al. 2021
“The man is wearing an orange and black polo shirt and is kneeling with his
lunch box in one hand while holding a banana in his other hand.”
“A man is wearing a green t shirt while holding a banana.”
©Tobias Deußer 8 / 32
Downstream tasks – Text Generation
Figure 2: Generating text from a prompt by leveraging GPT-3 (Brown et al. 2020) on
https://beta.openai.com/playground
©Tobias Deußer 9 / 32
Downstream tasks – Image Generation
Figure 3: Generating images from a prompt by leveraging DALL·E mini (Dayma et al. 2021) on
https://www.craiyon.com/
©Tobias Deußer 10 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 11 / 32
Generating text embeddings 1
©Tobias Deußer 12 / 32
Generating text embeddings 2

Text, i.e. strings, cannot be used in ML models without vectorization.

We need an embedding model to convert text to a numerical representation.

Examples for these embedding models:

tf-idf
 GloVe

Word2Vec

BERT
 …
©Tobias Deußer 13 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 14 / 32
Text preprocessing – Sentence Tokenization

Sentence tokenization describes the process of splitting a text corpus into
sentences.

Obviously, the key character to look for is the period (.) character and other
punctuation marks, i.e. ;:!?.

However, simply splitting whenever it occurs might lead to wrong results,
because:

The decimal separator occurs frequently in written text, e.g.: 3.14.

Abbreviations are shortened with a period, e.g. abbr. for abbreviation.
 Many more, see en.wikipedia.org/wiki/Full_stop
 A lot of implementations exist:
 Regex approach: github.com/mediacloud/sentence-splitter

Unsupervised approach: github.com/nltk/nltk
 …
©Tobias Deußer 15 / 32
Text preprocessing – Word Tokenization

Word tokenization describes the process of splitting a sentence (or any text)
into words.

Here, we mainly look for spaces and have to take care of punctuation marks.
1  from nltk.tokenize import word_tokenize
2  s = '''Good muffins cost $3.88nin New York. Please buy me two of
them.nnThanks.'''
3  word_tokenize(s)
4 ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', '
Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
Listing 1: Example from the nltk documentation
©Tobias Deußer 16 / 32
Text preprocessing – Subword Tokenization

Modern NLP approaches (e.g. Transformers like BERT) tokenize the input a
step further, splitting unknown (i.e., not in the vocabulary) words into
meaningful subwords, but preserving frequently used words.
1  from transformers import BertTokenizer
2  tokenizer = BertTokenizer.from_pretrained(bert-base-uncased)
3  tokenizer.tokenize('''Don't you love wasting GPU capacity by training
transformer models for your Advanced Method in Text Mining course?'''
)
4 ['don', ', 't', 'you', 'love', 'wasting', 'gp', '##u', 'capacity', 'by'
, 'training', 'transform', '##er', 'models', 'for', 'your', 'advanced'
, 'method', 'in', 'text', 'mining', 'course', '?']
Listing 2: Example using the BertTokenizer from the transformer package
©Tobias Deußer 17 / 32
Text preprocessing – Part-Of-Speech Tagging

Part-Of-Speech, or POS, tagging refers to the process of assigning “word
classes” to each token.

These “word classes” are the ones you probably learnt in elementary school:
nouns, verbs, adjectives, adverbs, …
 Methods:
 Dictionary lookup, e.g. Penn Treebank tagset1

Hidden Markov models
 Supervised models
1
Marcinkiewicz 1994
©Tobias Deußer 18 / 32
Text preprocessing – Stop Words Removal 1
 Stop words are the most common (like articles, prepositions, pronouns,
conjunctions, etc) and often do not add that much information to a sentence
1  from nltk.corpus import stopwords
2  stopwords.words(english)
3 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', you
're, you've, you'll, you'd, 'your', 'yours', 'yourself', '
yourselves', 'he', 'him', 'his', 'himself', 'she', she's, 'her', '
hers', 'herself', 'it', it's, 'its', 'itself', 'they', 'them', '
their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this'
, 'that', that'll, 'these', 'those', 'am', 'is', 'are', 'was', 'were
', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does'
, 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because
', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', '
against', 'between', 'into', 'through', 'during', 'before', 'after', '
above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off',
'over', 'under', 'again', 'further', 'then', 'once', 'here', ...]
Listing 3: Stop Words from the NLTK package
©Tobias Deußer 19 / 32
Text preprocessing – Stop Words Removal 2

But there are issues with this …
1  from nltk.corpus import stopwords
2  sentence = The dish was not tasty at all
3  [word for word in sentence.split() if word.lower() not in stopwords.
words(english)]
4 ['dish', 'tasty']
Listing 4: Removing “unimportant” words
©Tobias Deußer 20 / 32
Text preprocessing – Stemming 1
 Stemming normalizes words to their base stem

Examples:
 “liked” becomes “like”

“birds” becomes “bird”
 “itemization” becomes “item”
 Thus, these words are treated similarly

However, there are problems with this…
©Tobias Deußer 21 / 32
Text preprocessing – Stemming 2
 Overstemming – When the algorithm stems words to the same root even
though are unrelated

Understemming – The opposite
1  from nltk.stem.porter import PorterStemmer
2  stemmer = PorterStemmer()
3  words_to_be_stemmed = [liked, birds, itemization, universal,
university, universe, alumnus, alumni]
4  [stemmer.stem(word) for word in words_to_be_stemmed]
5 ['like', 'bird', 'item', 'univers', 'univers', 'univers', 'alumnu', '
alumni']
Listing 5: Examples of correct and incorrect stemming
©Tobias Deußer 22 / 32
Text preprocessing – Lemmatization

Lemmatization is very similar to stemming, but trys to incorporate context
and aligns word with their lemma
 Often leverages POS tags

Usually prefered to stemming (contextual analysis vs. hard coded rules)

Examples:

“went” becomes “go”
 “better” becomes “good”
 “meeting” becomes either “meet” or “meeting”, based on the POS tag
©Tobias Deußer 23 / 32
The usual NLP workflow / pipeline

Data gathering

Text parsing

Text preprocessing

Vectorization / featurization / embedding

A downstream task
©Tobias Deußer 24 / 32
Data Gathering  Text Parsing

Data sources:
 Web scraping
 PDFs, Word files, Excel files, scanned documents, …

Public datasets
 Your customers

…

Parse the raw data into your preferred data structure
©Tobias Deußer 25 / 32
Getting Started – A few things first

We will use Python for examples  assignments
 You are free to either use Jupyter Notebooks or a proper IDE like PyCharm or
VSCode

b-it has some GPU pools that we can use for assignment
©Tobias Deußer 26 / 32
Getting Started – Jupyter Notebooks

It is essential that your chosen provider offers GPU support

There are a few “free” provider out there:
 Google colab
 Kaggle

More on the web?
©Tobias Deußer 27 / 32
Getting Started – Your preferred IDE

Implement your code with your own IDE
 Access GPUs from b-it or your own “sick gaming rig”?
 Example IDEs:
 PyCharm Pro (free for students)

VSCode

Check https://www.jetbrains.com/help/pycharm/
configuring-remote-interpreters-via-ssh.html for how to
configure a remote interpreter with PyCharm Pro
©Tobias Deußer 28 / 32
Getting Started – Assignments

Assignments should be delivered as a .pdf file containing the written
answers and/or code

Format your code in a dedicated environment, see here for an example:
https://www.overleaf.com/learn/latex/Code_listing

Email your PDF file to bit.am4tm@gmail.com
©Tobias Deußer 29 / 32
Contact
Try this E-Mail address first: bit.am4tm@gmail.com
Tobias Deußer
tobias.deusser@iais.fraunhofer.de
Rafet Sifa
rafet.sifa@iais.fraunhofer.de
©Tobias Deußer 30 / 32
References I
Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
Andrew Y Ng, and Christopher Potts (2013). “Recursive deep models for
semantic compositionality over a sentiment treebank”. In: Proc. EMNLP,
pp. 1631–1642.
Deußer, Tobias, Syed Musharraf Ali, Lars Hillebrand, Desiana Nurchalifah,
Basil Jacob, Christian Bauckhage, and Rafet Sifa (2022). “KPI-EDGAR: A Novel
Dataset and Accompanying Metric for Relation Extraction from Financial
Documents”. In: Proc. ICMLA (to be published). doi:
10.48550/arXiv.2210.09163.
Pielka, Maren, Rafet Sifa, Lars Patrick Hillebrand, David Biesner,
Rajkumar Ramamurthy, Anna Ladi, and Christian Bauckhage (2021). “Tackling
contradiction detection in german using machine translation and end-to-end
recurrent neural networks”. In: Proc. ICPR, pp. 6696–6701.
©Tobias Deußer 31 / 32
References II
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, et al. (2020). “Language models are few-shot learners”. In: Proc.
NIPS, pp. 1877–1901.
Dayma, Boris, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham,
Phúc Lê Khắc, Luke Melas, and Ritobrata Ghosh (July 2021). DALL·E Mini. url:
https://github.com/borisdayma/dalle-mini.
Marcinkiewicz, Mary Ann (1994). “Building a large annotated corpus of English: The
Penn Treebank”. In: Using Large Corpora 273.
©Tobias Deußer 32 / 32

More Related Content

Similar to AM4TM_WS22_Practice_01_NLP_Basics.pdf

DataFirst approach to coding
DataFirst approach to codingDataFirst approach to coding
DataFirst approach to codingAto Mensah
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyAdrian Olszewski
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyAdrian Olszewski
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 
Mini-Project – EECE 365, Spring 2021 You are to read an
Mini-Project – EECE 365, Spring 2021   You are to read an Mini-Project – EECE 365, Spring 2021   You are to read an
Mini-Project – EECE 365, Spring 2021 You are to read an IlonaThornburg83
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in RAshraf Uddin
 
Cs121 Unit Test
Cs121 Unit TestCs121 Unit Test
Cs121 Unit TestJill Bell
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteDeep Kayal
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AIGreg Werner
 
IRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET Journal
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 

Similar to AM4TM_WS22_Practice_01_NLP_Basics.pdf (20)

DataFirst approach to coding
DataFirst approach to codingDataFirst approach to coding
DataFirst approach to coding
 
Meet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journeyMeet a 100% R-based CRO. The summary of a 5-year journey
Meet a 100% R-based CRO. The summary of a 5-year journey
 
Meet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journeyMeet a 100% R-based CRO - The summary of a 5-year journey
Meet a 100% R-based CRO - The summary of a 5-year journey
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Mini-Project – EECE 365, Spring 2021 You are to read an
Mini-Project – EECE 365, Spring 2021   You are to read an Mini-Project – EECE 365, Spring 2021   You are to read an
Mini-Project – EECE 365, Spring 2021 You are to read an
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Cs121 Unit Test
Cs121 Unit TestCs121 Unit Test
Cs121 Unit Test
 
Chatbot ppt
Chatbot pptChatbot ppt
Chatbot ppt
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Demystifying Ml, DL and AI
Demystifying Ml, DL and AIDemystifying Ml, DL and AI
Demystifying Ml, DL and AI
 
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
[IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav S...
 
IRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for BlindsIRJET- Virtual Vision for Blinds
IRJET- Virtual Vision for Blinds
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 

Recently uploaded

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 

Recently uploaded (20)

18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Staff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSDStaff of Color (SOC) Retention Efforts DDSD
Staff of Color (SOC) Retention Efforts DDSD
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 

AM4TM_WS22_Practice_01_NLP_Basics.pdf

  • 1. Natural Language Processing Basics Tobias Deußer, Dr. Rafet Sifa tobias.deusser@iais.fraunhofer.de 10/11/2022 Advanced Methods for Text Mining
  • 2. Agenda 1. What is Natural Language Processing 2. Preprocessing – Theory 3. Getting Started ©Tobias Deußer 1 / 32
  • 3. Natural Language Processing – The Wikipedia Definition From en.wikipedia.org/wiki/Natural_language_processing Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. ©Tobias Deußer 2 / 32
  • 4. Modern NLP in a nutshell ©Tobias Deußer 3 / 32
  • 5. The usual NLP workflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 4 / 32
  • 6. Downstream tasks Classification, e.g. sentiment or ratings Information extraction Named Entity Recognition Relation Extraction Natural Language Inference (NLI) Text generation Image generation … ©Tobias Deußer 5 / 32
  • 7. Downstream tasks – Sentiment Analysis Figure 1: Classifying text into sentiment, figure taken from Socher et al. 2013 ©Tobias Deußer 6 / 32
  • 8. Downstream tasks – Information Extraction Example from Deußer et al. 2022 “In 2021 and 2020 the total net revenue kpi was $100 cy million and $80 py million, respectively.” total net revenue kpi − 100 cy , total net revenue kpi − 80 py , ©Tobias Deußer 7 / 32
  • 9. Downstream tasks – Natural Language Inference Example from Pielka et al. 2021 “The man is wearing an orange and black polo shirt and is kneeling with his lunch box in one hand while holding a banana in his other hand.” “A man is wearing a green t shirt while holding a banana.” ©Tobias Deußer 8 / 32
  • 10. Downstream tasks – Text Generation Figure 2: Generating text from a prompt by leveraging GPT-3 (Brown et al. 2020) on https://beta.openai.com/playground ©Tobias Deußer 9 / 32
  • 11. Downstream tasks – Image Generation Figure 3: Generating images from a prompt by leveraging DALL·E mini (Dayma et al. 2021) on https://www.craiyon.com/ ©Tobias Deußer 10 / 32
  • 12. The usual NLP workflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 11 / 32
  • 13. Generating text embeddings 1 ©Tobias Deußer 12 / 32
  • 14. Generating text embeddings 2 Text, i.e. strings, cannot be used in ML models without vectorization. We need an embedding model to convert text to a numerical representation. Examples for these embedding models: tf-idf GloVe Word2Vec BERT … ©Tobias Deußer 13 / 32
  • 15. The usual NLP workflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 14 / 32
  • 16. Text preprocessing – Sentence Tokenization Sentence tokenization describes the process of splitting a text corpus into sentences. Obviously, the key character to look for is the period (.) character and other punctuation marks, i.e. ;:!?. However, simply splitting whenever it occurs might lead to wrong results, because: The decimal separator occurs frequently in written text, e.g.: 3.14. Abbreviations are shortened with a period, e.g. abbr. for abbreviation. Many more, see en.wikipedia.org/wiki/Full_stop A lot of implementations exist: Regex approach: github.com/mediacloud/sentence-splitter Unsupervised approach: github.com/nltk/nltk … ©Tobias Deußer 15 / 32
  • 17. Text preprocessing – Word Tokenization Word tokenization describes the process of splitting a sentence (or any text) into words. Here, we mainly look for spaces and have to take care of punctuation marks. 1 from nltk.tokenize import word_tokenize 2 s = '''Good muffins cost $3.88nin New York. Please buy me two of them.nnThanks.''' 3 word_tokenize(s) 4 ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', ' Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.'] Listing 1: Example from the nltk documentation ©Tobias Deußer 16 / 32
  • 18. Text preprocessing – Subword Tokenization Modern NLP approaches (e.g. Transformers like BERT) tokenize the input a step further, splitting unknown (i.e., not in the vocabulary) words into meaningful subwords, but preserving frequently used words. 1 from transformers import BertTokenizer 2 tokenizer = BertTokenizer.from_pretrained(bert-base-uncased) 3 tokenizer.tokenize('''Don't you love wasting GPU capacity by training transformer models for your Advanced Method in Text Mining course?''' ) 4 ['don', ', 't', 'you', 'love', 'wasting', 'gp', '##u', 'capacity', 'by' , 'training', 'transform', '##er', 'models', 'for', 'your', 'advanced' , 'method', 'in', 'text', 'mining', 'course', '?'] Listing 2: Example using the BertTokenizer from the transformer package ©Tobias Deußer 17 / 32
  • 19. Text preprocessing – Part-Of-Speech Tagging Part-Of-Speech, or POS, tagging refers to the process of assigning “word classes” to each token. These “word classes” are the ones you probably learnt in elementary school: nouns, verbs, adjectives, adverbs, … Methods: Dictionary lookup, e.g. Penn Treebank tagset1 Hidden Markov models Supervised models 1 Marcinkiewicz 1994 ©Tobias Deußer 18 / 32
  • 20. Text preprocessing – Stop Words Removal 1 Stop words are the most common (like articles, prepositions, pronouns, conjunctions, etc) and often do not add that much information to a sentence 1 from nltk.corpus import stopwords 2 stopwords.words(english) 3 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', you 're, you've, you'll, you'd, 'your', 'yours', 'yourself', ' yourselves', 'he', 'him', 'his', 'himself', 'she', she's, 'her', ' hers', 'herself', 'it', it's, 'its', 'itself', 'they', 'them', ' their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this' , 'that', that'll, 'these', 'those', 'am', 'is', 'are', 'was', 'were ', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does' , 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because ', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', ' against', 'between', 'into', 'through', 'during', 'before', 'after', ' above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', ...] Listing 3: Stop Words from the NLTK package ©Tobias Deußer 19 / 32
  • 21. Text preprocessing – Stop Words Removal 2 But there are issues with this … 1 from nltk.corpus import stopwords 2 sentence = The dish was not tasty at all 3 [word for word in sentence.split() if word.lower() not in stopwords. words(english)] 4 ['dish', 'tasty'] Listing 4: Removing “unimportant” words ©Tobias Deußer 20 / 32
  • 22. Text preprocessing – Stemming 1 Stemming normalizes words to their base stem Examples: “liked” becomes “like” “birds” becomes “bird” “itemization” becomes “item” Thus, these words are treated similarly However, there are problems with this… ©Tobias Deußer 21 / 32
  • 23. Text preprocessing – Stemming 2 Overstemming – When the algorithm stems words to the same root even though are unrelated Understemming – The opposite 1 from nltk.stem.porter import PorterStemmer 2 stemmer = PorterStemmer() 3 words_to_be_stemmed = [liked, birds, itemization, universal, university, universe, alumnus, alumni] 4 [stemmer.stem(word) for word in words_to_be_stemmed] 5 ['like', 'bird', 'item', 'univers', 'univers', 'univers', 'alumnu', ' alumni'] Listing 5: Examples of correct and incorrect stemming ©Tobias Deußer 22 / 32
  • 24. Text preprocessing – Lemmatization Lemmatization is very similar to stemming, but trys to incorporate context and aligns word with their lemma Often leverages POS tags Usually prefered to stemming (contextual analysis vs. hard coded rules) Examples: “went” becomes “go” “better” becomes “good” “meeting” becomes either “meet” or “meeting”, based on the POS tag ©Tobias Deußer 23 / 32
  • 25. The usual NLP workflow / pipeline Data gathering Text parsing Text preprocessing Vectorization / featurization / embedding A downstream task ©Tobias Deußer 24 / 32
  • 26. Data Gathering Text Parsing Data sources: Web scraping PDFs, Word files, Excel files, scanned documents, … Public datasets Your customers … Parse the raw data into your preferred data structure ©Tobias Deußer 25 / 32
  • 27. Getting Started – A few things first We will use Python for examples assignments You are free to either use Jupyter Notebooks or a proper IDE like PyCharm or VSCode b-it has some GPU pools that we can use for assignment ©Tobias Deußer 26 / 32
  • 28. Getting Started – Jupyter Notebooks It is essential that your chosen provider offers GPU support There are a few “free” provider out there: Google colab Kaggle More on the web? ©Tobias Deußer 27 / 32
  • 29. Getting Started – Your preferred IDE Implement your code with your own IDE Access GPUs from b-it or your own “sick gaming rig”? Example IDEs: PyCharm Pro (free for students) VSCode Check https://www.jetbrains.com/help/pycharm/ configuring-remote-interpreters-via-ssh.html for how to configure a remote interpreter with PyCharm Pro ©Tobias Deußer 28 / 32
  • 30. Getting Started – Assignments Assignments should be delivered as a .pdf file containing the written answers and/or code Format your code in a dedicated environment, see here for an example: https://www.overleaf.com/learn/latex/Code_listing Email your PDF file to bit.am4tm@gmail.com ©Tobias Deußer 29 / 32
  • 31. Contact Try this E-Mail address first: bit.am4tm@gmail.com Tobias Deußer tobias.deusser@iais.fraunhofer.de Rafet Sifa rafet.sifa@iais.fraunhofer.de ©Tobias Deußer 30 / 32
  • 32. References I Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts (2013). “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proc. EMNLP, pp. 1631–1642. Deußer, Tobias, Syed Musharraf Ali, Lars Hillebrand, Desiana Nurchalifah, Basil Jacob, Christian Bauckhage, and Rafet Sifa (2022). “KPI-EDGAR: A Novel Dataset and Accompanying Metric for Relation Extraction from Financial Documents”. In: Proc. ICMLA (to be published). doi: 10.48550/arXiv.2210.09163. Pielka, Maren, Rafet Sifa, Lars Patrick Hillebrand, David Biesner, Rajkumar Ramamurthy, Anna Ladi, and Christian Bauckhage (2021). “Tackling contradiction detection in german using machine translation and end-to-end recurrent neural networks”. In: Proc. ICPR, pp. 6696–6701. ©Tobias Deußer 31 / 32
  • 33. References II Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. (2020). “Language models are few-shot learners”. In: Proc. NIPS, pp. 1877–1901. Dayma, Boris, Suraj Patil, Pedro Cuenca, Khalid Saifullah, Tanishq Abraham, Phúc Lê Khắc, Luke Melas, and Ritobrata Ghosh (July 2021). DALL·E Mini. url: https://github.com/borisdayma/dalle-mini. Marcinkiewicz, Mary Ann (1994). “Building a large annotated corpus of English: The Penn Treebank”. In: Using Large Corpora 273. ©Tobias Deußer 32 / 32