SlideShare a Scribd company logo
1 of 41
Download to read offline
Big Data and Automated Content Analysis
Week 5 – Monday
»Automated content analysis with NLP and
regular expressions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
6 March 2017
Regular expressions Some more Natural Language Processing Take-home message & next steps
Today
1 ACA using regular expressions
Bottom-up vs. top-down
What is a regexp?
Using a regexp in Python
2 Some more Natural Language Processing
Stemming
Parsing sentences
3 Take-home message & next steps
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Automated content analysis using regular expressions
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Bottom-up vs. top-down
Bottom-up vs. top-down
Bottom-up
• Count most frequently occurring words (Week 2)
• Maybe better: Count combinations of words ⇒ Which words
co-occur together? (Chapter 9, not obligatory)
We don’t specify what to look for in advance
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Bottom-up vs. top-down
Bottom-up vs. top-down
Bottom-up
• Count most frequently occurring words (Week 2)
• Maybe better: Count combinations of words ⇒ Which words
co-occur together? (Chapter 9, not obligatory)
We don’t specify what to look for in advance
Top-down
• Count frequencies of pre-defined words (like in
BOW-sentiment analysis)
• Maybe better: patterns instead of words (regular
expressions, today!)
We do specify what to look for in advance
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
• You can use them in many editors (!), in the Terminal, in
STATA . . . and in Python
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
An example
From last week’s task
• We wanted to remove everything but words from a tweet
• We did so by calling the .replace() method
• We could do this with a regular expression as well:
[ˆa-zA-Z] would match anything that is not a letter
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Repetition
* the expression before occurs 0 or more times
+ the expression before occurs 1 or more times
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
3 RT ?:? @[a-zA-Z0-9]*
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
What else is possible?
If you google regexp or regular expression, you’ll get a bunch
of useful overviews. The wikipedia page is not too bad, either.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
re.sub("[Tt]witter|[Ff]acebook","a social medium",testo)
returns a string in which all all occurances of Twitter
or Facebook are replaced by "a social medium"
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.match(" +([0-9]+) of ([0-9]+) points",line) returns
None unless it exactly matches the string line. If it
does, you can access the part between () with the
.group() method.
Example:
1 line=" 2 of 25 points"
2 result=re.match(" +([0-9]+) of ([0-9]+) points",line)
3 if result:
4 print ("Your points:",result.group(1))
5 print ("Maximum points:",result.group(2))
Your points: 2
Maximum points: 25
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data preprocessing
• Remove unwanted characters, words, . . .
• Identify meaningful bits of text: usernames, headlines, where
an article starts, . . .
• filter (distinguish relevant from irrelevant cases)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data analysis: Automated coding
• Actors
• Brands
• links or other markers that follow a regular pattern
• Numbers (!)
Big Data and Automated Content Analysis Damian Trilling
Example 1: Counting actors
1 import re, csv
2 from os import listdir, path
3 mypath ="/home/damian/artikelen"
4 matchcount54_list=[]
5 matchcount10_list=[]
6 filename_list = [f for f in listdir(mypath) if path.isfile(path.join(
mypath,f))]
7 for f in filename_list:
8 matchcount54=0
9 matchcount10=0
10 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi:
11 artikel=fi.readlines()
12 for line in artikel:
13 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa]
uthorit)’,line)
14 matches10 = re.findall(’[Pp]alest’,line)
15 matchcount54+=len(matches54)
16 matchcount10+=len(matches10)
17 matchcount54_list.append(matchcount54)
18 matchcount10_list.append(matchcount10)
19 output=zip(filename_list,matchcount10_list,matchcount54_list)
20 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo:
21 writer = csv.writer(fo)
22 writer.writerows(output)
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Which number has this Lexis Nexis article?
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Check the number of a lexis nexis article
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
1 for line in tekst:
2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
3 if matchObj:
4 numberofarticle= int(matchObj.group(1))
5 totalnumberofarticles= int(matchObj.group(2))
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Practice yourself!
http://www.pyregex.com/
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more Natural Language Processing
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
• Parse sentences (advanced)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
NLP: What and why?
Why do stemming?
• Because we do not want to distinguish between smoke,
smoked, smoking, . . .
• Typical preprocessing step (like stopword removal)
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming
(with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing
with Python. Sebastopol, CA: O’Reilly.)
1 from nltk.stem.snowball import SnowballStemmer
2 stemmer=SnowballStemmer("english")
3 frase="I am running while generously greeting my neighbors"
4 frasenuevo=""
5 for palabra in frase.split():
6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
If we now did print(frasenuevo), it would return:
1 i am run while generous greet my neighbor
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the
following in the Python console and selecting the appropriate package from the menu that pops up:
import nltk
nltk.download()
NB: Don’t download everything, that’s several GB.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
NLP: What and why?
Why parse sentences?
• To find out what grammatical function words have
• and to get closer to the meaning.
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
1 import nltk
2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel
very good."
3 tokens = nltk.word_tokenize(sentence)
4 print (tokens)
nltk.word_tokenize(sentence) is similar to sentence.split(),
but compare handling of punctuation and the didn’t in the
output:
1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’,
"n’t", ’feel’, ’very’, ’good’, ’.’]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
And you could get the word type of "morning" with
tagged[5][1]!
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
More NLP
Look at http://nltk.org
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home message
Take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home messages
What you should be familiar with:
• Possible steps to preprocess the data
• Regular expressions
• Word counts
Big Data and Automated Content Analysis Damian Trilling
Regular expressions Some more Natural Language Processing Take-home message & next steps
Next meetings
Wednesday
Statistics with pandas and Jupyter Notebook
Take-home exam
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot

What's hot (20)

BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7BDACA1516s2 - Lecture7
BDACA1516s2 - Lecture7
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BD-ACA week5
BD-ACA week5BD-ACA week5
BD-ACA week5
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BD-ACA week2
BD-ACA week2BD-ACA week2
BD-ACA week2
 
BD-ACA week1a
BD-ACA week1aBD-ACA week1a
BD-ACA week1a
 
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4) Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (2/4)
 
BD-ACA Week6
BD-ACA Week6BD-ACA Week6
BD-ACA Week6
 
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
Big Data LDN 2017: Machine Learning on Structured Data. Why Is Learning Rules...
 
Data Visualization at Twitter
Data Visualization at TwitterData Visualization at Twitter
Data Visualization at Twitter
 

Viewers also liked

Viewers also liked (15)

Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"Data Science: Case "Political Communication 1/2"
Data Science: Case "Political Communication 1/2"
 
Guest lecture at Coding Culture, Utrecht
Guest lecture at Coding Culture, UtrechtGuest lecture at Coding Culture, Utrecht
Guest lecture at Coding Culture, Utrecht
 
Social TV Ratings: A Multi-Case Analysis From Turkish Television Industry
Social TV Ratings: A Multi-Case Analysis From Turkish Television IndustrySocial TV Ratings: A Multi-Case Analysis From Turkish Television Industry
Social TV Ratings: A Multi-Case Analysis From Turkish Television Industry
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 2/2"
 
HELLO
HELLOHELLO
HELLO
 
Comm 125-Draft
Comm 125-DraftComm 125-Draft
Comm 125-Draft
 
HC0026.01.INTR
HC0026.01.INTRHC0026.01.INTR
HC0026.01.INTR
 
Konversi satuan
Konversi satuanKonversi satuan
Konversi satuan
 
Proceso administrativo unicor
Proceso administrativo unicorProceso administrativo unicor
Proceso administrativo unicor
 
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
SAMOA: A Platform for Mining Big Data Streams (Apache BigData North America 2...
 
シークレット・オブ・CSSシークレット改訂版
シークレット・オブ・CSSシークレット改訂版シークレット・オブ・CSSシークレット改訂版
シークレット・オブ・CSSシークレット改訂版
 
NLP from scratch
NLP from scratch NLP from scratch
NLP from scratch
 
GlusterFS as a DFS
GlusterFS as a DFSGlusterFS as a DFS
GlusterFS as a DFS
 
NLP_session-1
NLP_session-1NLP_session-1
NLP_session-1
 

Similar to BDACA1617s2 - Lecture5

Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
sagaroceanic11
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
Avelin Huo
 

Similar to BDACA1617s2 - Lecture5 (20)

Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
Programming with Python
Programming with PythonProgramming with Python
Programming with Python
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
Introduction to Python
Introduction to Python Introduction to Python
Introduction to Python
 
CPPDS Slide.pdf
CPPDS Slide.pdfCPPDS Slide.pdf
CPPDS Slide.pdf
 
Exam 1 Review CS 1803
Exam 1 Review CS 1803Exam 1 Review CS 1803
Exam 1 Review CS 1803
 
BD-ACA week3a
BD-ACA week3aBD-ACA week3a
BD-ACA week3a
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Python slide
Python slidePython slide
Python slide
 
Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03Designing A Syntax Based Retrieval System03
Designing A Syntax Based Retrieval System03
 
Python introduction
Python introductionPython introduction
Python introduction
 
Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python Let’s Learn Python An introduction to Python
Let’s Learn Python An introduction to Python
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
A Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query LanguagesA Brief Overview of (Static) Program Query Languages
A Brief Overview of (Static) Program Query Languages
 
Python For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandPython For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in Hand
 
Pa1 session 2
Pa1 session 2 Pa1 session 2
Pa1 session 2
 

More from Department of Communication Science, University of Amsterdam

More from Department of Communication Science, University of Amsterdam (8)

BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 
BDACA1516s2 - Lecture4
 BDACA1516s2 - Lecture4 BDACA1516s2 - Lecture4
BDACA1516s2 - Lecture4
 
BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1BDACA1516s2 - Lecture1
BDACA1516s2 - Lecture1
 
Should we worry about filter bubbles?
Should we worry about filter bubbles?Should we worry about filter bubbles?
Should we worry about filter bubbles?
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 

BDACA1617s2 - Lecture5

  • 1. Big Data and Automated Content Analysis Week 5 – Monday »Automated content analysis with NLP and regular expressions« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 6 March 2017
  • 2. Regular expressions Some more Natural Language Processing Take-home message & next steps Today 1 ACA using regular expressions Bottom-up vs. top-down What is a regexp? Using a regexp in Python 2 Some more Natural Language Processing Stemming Parsing sentences 3 Take-home message & next steps Big Data and Automated Content Analysis Damian Trilling
  • 3. Regular expressions Some more Natural Language Processing Take-home message & next steps Automated content analysis using regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 4. Regular expressions Some more Natural Language Processing Take-home message & next steps Bottom-up vs. top-down Bottom-up vs. top-down Bottom-up • Count most frequently occurring words (Week 2) • Maybe better: Count combinations of words ⇒ Which words co-occur together? (Chapter 9, not obligatory) We don’t specify what to look for in advance Big Data and Automated Content Analysis Damian Trilling
  • 5. Regular expressions Some more Natural Language Processing Take-home message & next steps Bottom-up vs. top-down Bottom-up vs. top-down Bottom-up • Count most frequently occurring words (Week 2) • Maybe better: Count combinations of words ⇒ Which words co-occur together? (Chapter 9, not obligatory) We don’t specify what to look for in advance Top-down • Count frequencies of pre-defined words (like in BOW-sentiment analysis) • Maybe better: patterns instead of words (regular expressions, today!) We do specify what to look for in advance Big Data and Automated Content Analysis Damian Trilling
  • 6. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings Big Data and Automated Content Analysis Damian Trilling
  • 7. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful Big Data and Automated Content Analysis Damian Trilling
  • 8. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Regular Expressions: What and why? What is a regexp? • a very widespread way to describe patterns in strings • Think of wildcards like * or operators like OR, AND or NOT in search strings: a regexp does the same, but is much more powerful • You can use them in many editors (!), in the Terminal, in STATA . . . and in Python Big Data and Automated Content Analysis Damian Trilling
  • 9. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? An example From last week’s task • We wanted to remove everything but words from a tweet • We did so by calling the .replace() method • We could do this with a regular expression as well: [ˆa-zA-Z] would match anything that is not a letter Big Data and Automated Content Analysis Damian Trilling
  • 10. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Big Data and Automated Content Analysis Damian Trilling
  • 11. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? Basic regexp elements Alternatives [TtFf] matches either T or t or F or f Twitter|Facebook matches either Twitter or Facebook . matches any character Repetition * the expression before occurs 0 or more times + the expression before occurs 1 or more times Big Data and Automated Content Analysis Damian Trilling
  • 12. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython Big Data and Automated Content Analysis Damian Trilling
  • 13. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ Big Data and Automated Content Analysis Damian Trilling
  • 14. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? regexp quizz Which words would be matched? 1 [Pp]ython 2 [A-Z]+ 3 RT ?:? @[a-zA-Z0-9]* Big Data and Automated Content Analysis Damian Trilling
  • 15. Regular expressions Some more Natural Language Processing Take-home message & next steps What is a regexp? What else is possible? If you google regexp or regular expression, you’ll get a bunch of useful overviews. The wikipedia page is not too bad, either. Big Data and Automated Content Analysis Damian Trilling
  • 16. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo Big Data and Automated Content Analysis Damian Trilling
  • 17. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.findall("[Tt]witter|[Ff]acebook",testo) returns a list with all occurances of Twitter or Facebook in the string called testo re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all words that start with one or more numbers followed by one or more letters in the string called testo re.sub("[Tt]witter|[Ff]acebook","a social medium",testo) returns a string in which all all occurances of Twitter or Facebook are replaced by "a social medium" Big Data and Automated Content Analysis Damian Trilling
  • 18. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python How to use regular expressions in Python The module re re.match(" +([0-9]+) of ([0-9]+) points",line) returns None unless it exactly matches the string line. If it does, you can access the part between () with the .group() method. Example: 1 line=" 2 of 25 points" 2 result=re.match(" +([0-9]+) of ([0-9]+) points",line) 3 if result: 4 print ("Your points:",result.group(1)) 5 print ("Maximum points:",result.group(2)) Your points: 2 Maximum points: 25 Big Data and Automated Content Analysis Damian Trilling
  • 19. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data preprocessing • Remove unwanted characters, words, . . . • Identify meaningful bits of text: usernames, headlines, where an article starts, . . . • filter (distinguish relevant from irrelevant cases) Big Data and Automated Content Analysis Damian Trilling
  • 20. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Possible applications Data analysis: Automated coding • Actors • Brands • links or other markers that follow a regular pattern • Numbers (!) Big Data and Automated Content Analysis Damian Trilling
  • 21. Example 1: Counting actors 1 import re, csv 2 from os import listdir, path 3 mypath ="/home/damian/artikelen" 4 matchcount54_list=[] 5 matchcount10_list=[] 6 filename_list = [f for f in listdir(mypath) if path.isfile(path.join( mypath,f))] 7 for f in filename_list: 8 matchcount54=0 9 matchcount10=0 10 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi: 11 artikel=fi.readlines() 12 for line in artikel: 13 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa] uthorit)’,line) 14 matches10 = re.findall(’[Pp]alest’,line) 15 matchcount54+=len(matches54) 16 matchcount10+=len(matches10) 17 matchcount54_list.append(matchcount54) 18 matchcount10_list.append(matchcount10) 19 output=zip(filename_list,matchcount10_list,matchcount54_list) 20 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo: 21 writer = csv.writer(fo) 22 writer.writerows(output)
  • 22. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Which number has this Lexis Nexis article? 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste Big Data and Automated Content Analysis Damian Trilling
  • 23. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Example 2: Check the number of a lexis nexis article 1 All Rights Reserved 2 3 2 of 200 DOCUMENTS 4 5 De Telegraaf 6 7 21 maart 2014 vrijdag 8 9 Brussel bereikt akkoord aanpak probleembanken; 10 ECB krijgt meer in melk te brokkelen 11 12 SECTION: Finance; Blz. 24 13 LENGTH: 660 woorden 14 15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt 16 over een saneringsfonds voor banken. Daarmee staat de laatste 1 for line in tekst: 2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line) 3 if matchObj: 4 numberofarticle= int(matchObj.group(1)) 5 totalnumberofarticles= int(matchObj.group(2)) Big Data and Automated Content Analysis Damian Trilling
  • 24. Regular expressions Some more Natural Language Processing Take-home message & next steps Using a regexp in Python Practice yourself! http://www.pyregex.com/ Big Data and Automated Content Analysis Damian Trilling
  • 25. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more Natural Language Processing Big Data and Automated Content Analysis Damian Trilling
  • 26. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) Big Data and Automated Content Analysis Damian Trilling
  • 27. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming Big Data and Automated Content Analysis Damian Trilling
  • 28. Regular expressions Some more Natural Language Processing Take-home message & next steps Some more NLP: What and why? What can we do? • remove stopwords (last week) • stemming • Parse sentences (advanced) Big Data and Automated Content Analysis Damian Trilling
  • 29. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming NLP: What and why? Why do stemming? • Because we do not want to distinguish between smoke, smoked, smoking, . . . • Typical preprocessing step (like stopword removal) Big Data and Automated Content Analysis Damian Trilling
  • 30. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming (with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with Python. Sebastopol, CA: O’Reilly.) 1 from nltk.stem.snowball import SnowballStemmer 2 stemmer=SnowballStemmer("english") 3 frase="I am running while generously greeting my neighbors" 4 frasenuevo="" 5 for palabra in frase.split(): 6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " If we now did print(frasenuevo), it would return: 1 i am run while generous greet my neighbor Big Data and Automated Content Analysis Damian Trilling
  • 31. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! Big Data and Automated Content Analysis Damian Trilling
  • 32. Regular expressions Some more Natural Language Processing Take-home message & next steps Stemming Stemming and stopword removal - let’s combine them! 1 from nltk.stem.snowball import SnowballStemmer 2 from nltk.corpus import stopwords 3 stemmer=SnowballStemmer("english") 4 stopwords = stopwords.words("english") 5 frase="I am running while generously greeting my neighbors" 6 frasenuevo="" 7 for palabra in frase.lower().split(): 8 if palabra not in stopwords: 9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " " Now, print(frasenuevo) returns: 1 run generous greet neighbor Perfect! In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the following in the Python console and selecting the appropriate package from the menu that pops up: import nltk nltk.download() NB: Don’t download everything, that’s several GB. Big Data and Automated Content Analysis Damian Trilling
  • 33.
  • 34. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences NLP: What and why? Why parse sentences? • To find out what grammatical function words have • and to get closer to the meaning. Big Data and Automated Content Analysis Damian Trilling
  • 35. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence 1 import nltk 2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel very good." 3 tokens = nltk.word_tokenize(sentence) 4 print (tokens) nltk.word_tokenize(sentence) is similar to sentence.split(), but compare handling of punctuation and the didn’t in the output: 1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’, "n’t", ’feel’, ’very’, ’good’, ’.’] Big Data and Automated Content Analysis Damian Trilling
  • 36. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] Big Data and Automated Content Analysis Damian Trilling
  • 37. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences Parsing a sentence Now, as the next step, you can “tag” the tokenized sentence: 1 tagged = nltk.pos_tag(tokens) 2 print (tagged[0:6]) gives you the following: 1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’), 2 (’Thursday’, ’NNP’), (’morning’, ’NN’)] And you could get the word type of "morning" with tagged[5][1]! Big Data and Automated Content Analysis Damian Trilling
  • 38. Regular expressions Some more Natural Language Processing Take-home message & next steps Parsing sentences More NLP Look at http://nltk.org Big Data and Automated Content Analysis Damian Trilling
  • 39. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home message Take-home exam Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 40. Regular expressions Some more Natural Language Processing Take-home message & next steps Take-home messages What you should be familiar with: • Possible steps to preprocess the data • Regular expressions • Word counts Big Data and Automated Content Analysis Damian Trilling
  • 41. Regular expressions Some more Natural Language Processing Take-home message & next steps Next meetings Wednesday Statistics with pandas and Jupyter Notebook Take-home exam Big Data and Automated Content Analysis Damian Trilling