SlideShare a Scribd company logo
1 of 40
Download to read offline
Big Data and Automated Content Analysis
Week 5 – Wednesday
»Regular expressions and NLP«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
29 April 2014
Regular expressions Some more Natural Language Processing Take-home message & next steps
Today
1 ACA using regular expressions
What is a regexp?
Using a regexp in Python
2 Some more Natural Language Processing
Stemming
Parsing sentences
3 Take-home message & next steps
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Automated content analysis using regular expressions
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
• You can use them in many editors (!), in the Terminal, in
STATA . . . and in Python
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
An example
From last week’s task
• We wanted to remove everything but words from a tweet
• We did so by calling the .replace() method
• We could do this with a regular expression as well:
[ˆa-zA-Z] would match anything that is not a letter
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Repetition
* the expression before occurs 0 or more times
+ the expression before occurs 1 or more times
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
3 RT :* @[a-zA-Z0-9]*
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
What else is possible?
If you google regexp or regular expression, you’ll get a bunch
of useful overviews. The wikipedia page is not too bad, either.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
re.sub("[Tt]witter|[Ff]acebook","a social medium",testo)
returns a string in which all all occurances of Twitter
or Facebook are replaced by "a social medium"
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.match(" +([0-9]+) of ([0-9]+) points",line) returns
None unless it exactly matches the string line. If it
does, you can access the part between () with the
.group() method.
Example:
1 line=" 2 of 25 points"
2 result=re.match(" +([0-9]+) of ([0-9]+) points",line)
3 if result:
4 print ("Your points:",result.group(1))
5 print ("Maximum points:",result.group(2))
Your points: 2
Maximum points: 25
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data preprocessing
• Remove unwanted characters, words, . . .
• Identify meaningful bits of text: usernames, headlines, where
an article starts, . . .
• filter (distinguish relevant from irrelevant cases)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data analysis: Automated coding
• Actors
• Brands
• links or other markers that follow a regular pattern
• Numbers (!)
Big Data and Automated Content Analysis Damian Trilling
Example 1: Counting actors
1 import re, csv
2 from os import listdir, path
3 mypath ="/home/damian/artikelen"
4 filename_list=[]
5 matchcount54_list=[]
6 matchcount10_list=[]
7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath,
f))]
8 for f in onlyfiles:
9 matchcount54=0
10 matchcount10=0
11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi:
12 artikel=fi.readlines()
13 for line in artikel:
14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa]
uthorit)’,line)
15 matches10 = re.findall(’[Pp]alest’,line)
16 matchcount54+=len(matches54)
17 matchcount10+=len(matches10)
18 filename_list.append(f)
19 matchcount54_list.append(matchcount54)
20 matchcount10_list.append(matchcount10)
21 output=zip(filename_list,matchcount10_list,matchcount54_list)
22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo:
23 writer = csv.writer(fo)
24 writer.writerows(output)
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Which number has this Lexis Nexis article?
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Check the number of a lexis nexis article
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
1 for line in tekst:
2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
3 if matchObj:
4 numberofarticle= int(matchObj.group(1))
5 totalnumberofarticles= int(matchObj.group(2))
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Practice yourself!
http://www.pyregex.com/
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more Natural Language Processing
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
• Parse sentences (advanced)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
NLP: What and why?
Why do stemming?
• Because we do not want to distinguish between smoke,
smoked, smoking, . . .
• Typical preprocessing step (like stopword removal)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming
(with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing
with Python. Sebastopol, CA: O’Reilly.)
1 from nltk.stem.snowball import SnowballStemmer
2 stemmer=SnowballStemmer("english")
3 frase="I am running while generously greeting my neighbors"
4 frasenuevo=""
5 for palabra in frase.split():
6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
If we now did print(frasenuevo), it would return:
1 i am run while generous greet my neighbor
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the
following in the Python console and selecting the appropriate package from the menu that pops up:
import nltk
nltk.download()
NB: Don’t download everything, that’s several GB.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
NLP: What and why?
Why parse sentences?
• To find out what grammatical function words have
• and to get closer to the meaning.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
1 import nltk
2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel
very good."
3 tokens = nltk.word_tokenize(sentence)
4 print (tokens)
nltk.word_tokenize(sentence) is similar to sentence.split(),
but compare handling of punctuation and the didn’t in the
output:
1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’,
"n’t", ’feel’, ’very’, ’good’, ’.’]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
And you could get the word type of "morning" with
tagged[5][1]!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
More NLP
Look at http://nltk.org
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home message
Take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home messages
What you should be familiar with:
• Steps to preprocessing the data
• Regular expressions
• Word counts and co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home exam
You will receive detailed instructions NOW.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Next meetings: Week 6
No meeting on Monday (http://4en5mei.nl)
Wednesday
Scraping and parsing
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsNelson Auner
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlBen Healey
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 

What's hot (20)

BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3BDACA1516s2 - Lecture3
BDACA1516s2 - Lecture3
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
Working with text data
Working with text dataWorking with text data
Working with text data
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1An Introduction To Python - Lists, Part 1
An Introduction To Python - Lists, Part 1
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
Text Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated DocumentsText Analysis: Latent Topics and Annotated Documents
Text Analysis: Latent Topics and Annotated Documents
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
Google code search
Google code searchGoogle code search
Google code search
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 

Viewers also liked

Viewers also liked (6)

Developing with the Twitter API
Developing with the Twitter APIDeveloping with the Twitter API
Developing with the Twitter API
 
Twitter API 2.0
Twitter API 2.0Twitter API 2.0
Twitter API 2.0
 
DXN - Egy legális fekete üzlet!
DXN - Egy legális fekete üzlet!DXN - Egy legális fekete üzlet!
DXN - Egy legális fekete üzlet!
 
The Twitter API: A Presentation to Adobe
The Twitter API: A Presentation to AdobeThe Twitter API: A Presentation to Adobe
The Twitter API: A Presentation to Adobe
 
Twitter api
Twitter apiTwitter api
Twitter api
 
Twitter PPT
Twitter PPTTwitter PPT
Twitter PPT
 

Similar to BD-ACA week5

Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python amiable_indian
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
NLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowNLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowFUMERY Michael
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python ProgrammingDozie Agbo
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)Mike Felch
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?lichtkind
 
Clean code, Better coding practices
Clean code, Better coding practicesClean code, Better coding practices
Clean code, Better coding practicesParamvir Singh
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
Presentation on basics of python
Presentation on basics of pythonPresentation on basics of python
Presentation on basics of pythonNanditaDutta4
 
Automation Testing theory notes.pptx
Automation Testing theory notes.pptxAutomation Testing theory notes.pptx
Automation Testing theory notes.pptxNileshBorkar12
 
Python_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. txPython_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. txvishwanathgoudapatil1
 

Similar to BD-ACA week5 (20)

Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
BD-ACA week7a
BD-ACA week7aBD-ACA week7a
BD-ACA week7a
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
NLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions StackoverflowNLP - Prédictions de tags sur les questions Stackoverflow
NLP - Prédictions de tags sur les questions Stackoverflow
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
First Steps in Python Programming
First Steps in Python ProgrammingFirst Steps in Python Programming
First Steps in Python Programming
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Pa1 session 2
Pa1 session 2 Pa1 session 2
Pa1 session 2
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 
Clean code, Better coding practices
Clean code, Better coding practicesClean code, Better coding practices
Clean code, Better coding practices
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Presentation on basics of python
Presentation on basics of pythonPresentation on basics of python
Presentation on basics of python
 
Python45 2
Python45 2Python45 2
Python45 2
 
Automation Testing theory notes.pptx
Automation Testing theory notes.pptxAutomation Testing theory notes.pptx
Automation Testing theory notes.pptx
 
Python_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. txPython_Haegl.powerpoint presentation. tx
Python_Haegl.powerpoint presentation. tx
 

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (13)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 

Recently uploaded (20)

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 

BD-ACA week5

  • 1. Big Data and Automated Content Analysis Week 5 – Wednesday »Regular expressions and NLP« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 29 April 2014
  • 2. Regular expressions Some more Natural Language Processing Take-home message & next steps Today 1 ACA using regular expressions What is a regexp? Using a regexp in Python 2 Some more Natural Language Processing Stemming Parsing sentences 3 Take-home message & next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Regular expressions Some more Natural Language Processing Take-home message & next steps Automated content analysis using regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 4. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings Big Data and Automated Content Analysis Damian Trilling
  • 5. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful Big Data and Automated Content Analysis Damian Trilling
  • 6. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful • You can use them in many editors (!), in the Terminal, in STATA . . . and in Python Big Data and Automated Content Analysis Damian Trilling
  • 7. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? An example From last week’s task • We wanted to remove everything but words from a tweet • We did so by calling the .replace() method • We could do this with a regular expression as well: [ˆa-zA-Z] would match anything that is not a letter Big Data and Automated Content Analysis Damian Trilling
  • 8. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Big Data and Automated Content Analysis Damian Trilling
  • 9. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Repetition * the expression before occurs 0 or more times + the expression before occurs 1 or more times Big Data and Automated Content Analysis Damian Trilling
  • 10. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython Big Data and Automated Content Analysis Damian Trilling
  • 11. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ Big Data and Automated Content Analysis Damian Trilling
  • 12. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ 3 RT :* @[a-zA-Z0-9]* Big Data and Automated Content Analysis Damian Trilling
  • 13. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? What else is possible? If you google regexp or regular expression, you’ll get a bunch of useful overviews. The wikipedia page is not too bad, either. Big Data and Automated Content Analysis Damian Trilling
  • 14. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo Big Data and Automated Content Analysis Damian Trilling
  • 15. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo re.sub("[Tt]witter|[Ff]acebook","a social medium",testo) returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium" Big Data and Automated Content Analysis Damian Trilling
  • 16. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.match(" +([0-9]+) of ([0-9]+) points",line) returns None unless it exactly matches the string line. If it does, you can access the part between () with the .group() method. Example: 1 line=" 2 of 25 points" 2 result=re.match(" +([0-9]+) of ([0-9]+) points",line) 3 if result: 4 print ("Your points:",result.group(1)) 5 print ("Maximum points:",result.group(2)) Your points: 2 Maximum points: 25 Big Data and Automated Content Analysis Damian Trilling
  • 17. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data preprocessing • Remove unwanted characters, words, . . . • Identify meaningful bits of text: usernames, headlines, where an article starts, . . . • filter (distinguish relevant from irrelevant cases) Big Data and Automated Content Analysis Damian Trilling
  • 18. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data analysis: Automated coding • Actors • Brands • links or other markers that follow a regular pattern • Numbers (!) Big Data and Automated Content Analysis Damian Trilling
  • 19. Example 1: Counting actors 1 import re, csv 2 from os import listdir, path 3 mypath ="/home/damian/artikelen" 4 filename_list=[] 5 matchcount54_list=[] 6 matchcount10_list=[] 7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath, f))] 8 for f in onlyfiles: 9 matchcount54=0 10 matchcount10=0 11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi: 12 artikel=fi.readlines() 13 for line in artikel: 14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa] uthorit)’,line) 15 matches10 = re.findall(’[Pp]alest’,line) 16 matchcount54+=len(matches54) 17 matchcount10+=len(matches10) 18 filename_list.append(f) 19 matchcount54_list.append(matchcount54) 20 matchcount10_list.append(matchcount10) 21 output=zip(filename_list,matchcount10_list,matchcount54_list) 22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo: 23 writer = csv.writer(fo) 24 writer.writerows(output)
  • 20. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Which number has this Lexis Nexis article? 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste Big Data and Automated Content Analysis Damian Trilling
  • 21. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Check the number of a lexis nexis article 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste 1 for line in tekst: 2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line) 3 if matchObj: 4 numberofarticle= int(matchObj.group(1)) 5 totalnumberofarticles= int(matchObj.group(2)) Big Data and Automated Content Analysis Damian Trilling
  • 22. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Practice yourself! http://www.pyregex.com/ Big Data and Automated Content Analysis Damian Trilling
  • 23. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more Natural Language Processing Big Data and Automated Content Analysis Damian Trilling
  • 24. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) Big Data and Automated Content Analysis Damian Trilling
  • 25. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming Big Data and Automated Content Analysis Damian Trilling
  • 26. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming • Parse sentences (advanced) Big Data and Automated Content Analysis Damian Trilling
  • 27. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming NLP: What and why? Why do stemming? • Because we do not want to distinguish between smoke, smoked, smoking, . . . • Typical preprocessing step (like stopword removal) Big Data and Automated Content Analysis Damian Trilling
  • 28. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming (with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with Python. Sebastopol, CA: O’Reilly.) 1 from nltk.stem.snowball import SnowballStemmer 2 stemmer=SnowballStemmer("english") 3 frase="I am running while generously greeting my neighbors" 4 frasenuevo="" 5 for palabra in frase.split(): 6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " If we now did print(frasenuevo), it would return: 1 i am run while generous greet my neighbor Big Data and Automated Content Analysis Damian Trilling
  • 29. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! Big Data and Automated Content Analysis Damian Trilling
  • 30. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the following in the Python console and selecting the appropriate package from the menu that pops up: import nltk nltk.download() NB: Don’t download everything, that’s several GB. Big Data and Automated Content Analysis Damian Trilling
  • 31.
  • 32. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences NLP: What and why? Why parse sentences? • To find out what grammatical function words have • and to get closer to the meaning. Big Data and Automated Content Analysis Damian Trilling
  • 33. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence 1 import nltk 2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel very good." 3 tokens = nltk.word_tokenize(sentence) 4 print (tokens) nltk.word_tokenize(sentence) is similar to sentence.split(), but compare handling of punctuation and the didn’t in the output: 1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’, "n’t", ’feel’, ’very’, ’good’, ’.’] Big Data and Automated Content Analysis Damian Trilling
  • 34. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] Big Data and Automated Content Analysis Damian Trilling
  • 35. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] And you could get the word type of "morning" with tagged[5][1]! Big Data and Automated Content Analysis Damian Trilling
  • 36. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences More NLP Look at http://nltk.org Big Data and Automated Content Analysis Damian Trilling
  • 37. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home message Take-home exam Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 38. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home messages What you should be familiar with: • Steps to preprocessing the data • Regular expressions • Word counts and co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 39. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home exam You will receive detailed instructions NOW. Big Data and Automated Content Analysis Damian Trilling
  • 40. Regular expressions Some more Natural Language Processing Take-home message & next steps Next meetings: Week 6 No meeting on Monday (http://4en5mei.nl) Wednesday Scraping and parsing Big Data and Automated Content Analysis Damian Trilling