Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the second meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the second meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the sixth meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. Fourth meeting.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
I argue why I think that Computer Science (or better: Informatics) is a "natural science", in the same sense that physics, astronomy, biology, psychology and sociology are a natural science: they study a part of the world around us. In that same sense, I think Informatics studies a part of the world around us.
For a similar talk (including script), but more aimed at a Semantic Web audience in particular, see http://www.cs.vu.nl/~frankh/spool/ISWC2011Keynote/
(or http://videolectures.net/iswc2011_van_harmelen_universal/ for a video registration)
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. Fourth meeting.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
I argue why I think that Computer Science (or better: Informatics) is a "natural science", in the same sense that physics, astronomy, biology, psychology and sociology are a natural science: they study a part of the world around us. In that same sense, I think Informatics studies a part of the world around us.
For a similar talk (including script), but more aimed at a Semantic Web audience in particular, see http://www.cs.vu.nl/~frankh/spool/ISWC2011Keynote/
(or http://videolectures.net/iswc2011_van_harmelen_universal/ for a video registration)
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python.
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
This file contains the first steps any beginner can take as he/she starts a journey into the rich and beautiful world of Python programming. From basics such as variables to data types and recursions, this document touches briefly on these concepts. It is not, by any means, an exhaustive guide to learn Python, but it serves as a good starting point and motivation.
Slides for the course Big Data and Automated Content Analysis, in which students of the social sciences (communication science) learn how to conduct analyses using Python. Third meeting.
NLP - Prédictions de tags sur les questions StackoverflowFUMERY Michael
Présentation des techniques de NLP, Bag of words, TFIDF, modélisations supervisées et non-supervisées pour prédiction des tags automatiques sur les questions Stackoverflow
Why is there so much emphasis on reading the code rather than writing?
Because:
Ratio of time spent reading vs. writing code is well over 10:1
So if we are spending 1 hour writing code, on an average we have to spend 10 hours reading some code. Each time we change in logic of some code, add or remove some feature, resolve some bug, understand the context of what is going on, or just use a third party API, what we do? We READ code.
At HomeShop18, we strive for clean code.
Slides for the first meeting of the course 'Big Data and Automated Content Analysis' at the Department of Communication Science, University of Amsterdam
Biological screening of herbal drugs: Introduction and Need for
Phyto-Pharmacological Screening, New Strategies for evaluating
Natural Products, In vitro evaluation techniques for Antioxidants, Antimicrobial and Anticancer drugs. In vivo evaluation techniques
for Anti-inflammatory, Antiulcer, Anticancer, Wound healing, Antidiabetic, Hepatoprotective, Cardio protective, Diuretics and
Antifertility, Toxicity studies as per OECD guidelines
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
June 3, 2024 Anti-Semitism Letter Sent to MIT President Kornbluth and MIT Cor...Levi Shapiro
Letter from the Congress of the United States regarding Anti-Semitism sent June 3rd to MIT President Sally Kornbluth, MIT Corp Chair, Mark Gorenberg
Dear Dr. Kornbluth and Mr. Gorenberg,
The US House of Representatives is deeply concerned by ongoing and pervasive acts of antisemitic
harassment and intimidation at the Massachusetts Institute of Technology (MIT). Failing to act decisively to ensure a safe learning environment for all students would be a grave dereliction of your responsibilities as President of MIT and Chair of the MIT Corporation.
This Congress will not stand idly by and allow an environment hostile to Jewish students to persist. The House believes that your institution is in violation of Title VI of the Civil Rights Act, and the inability or
unwillingness to rectify this violation through action requires accountability.
Postsecondary education is a unique opportunity for students to learn and have their ideas and beliefs challenged. However, universities receiving hundreds of millions of federal funds annually have denied
students that opportunity and have been hijacked to become venues for the promotion of terrorism, antisemitic harassment and intimidation, unlawful encampments, and in some cases, assaults and riots.
The House of Representatives will not countenance the use of federal funds to indoctrinate students into hateful, antisemitic, anti-American supporters of terrorism. Investigations into campus antisemitism by the Committee on Education and the Workforce and the Committee on Ways and Means have been expanded into a Congress-wide probe across all relevant jurisdictions to address this national crisis. The undersigned Committees will conduct oversight into the use of federal funds at MIT and its learning environment under authorities granted to each Committee.
• The Committee on Education and the Workforce has been investigating your institution since December 7, 2023. The Committee has broad jurisdiction over postsecondary education, including its compliance with Title VI of the Civil Rights Act, campus safety concerns over disruptions to the learning environment, and the awarding of federal student aid under the Higher Education Act.
• The Committee on Oversight and Accountability is investigating the sources of funding and other support flowing to groups espousing pro-Hamas propaganda and engaged in antisemitic harassment and intimidation of students. The Committee on Oversight and Accountability is the principal oversight committee of the US House of Representatives and has broad authority to investigate “any matter” at “any time” under House Rule X.
• The Committee on Ways and Means has been investigating several universities since November 15, 2023, when the Committee held a hearing entitled From Ivory Towers to Dark Corners: Investigating the Nexus Between Antisemitism, Tax-Exempt Universities, and Terror Financing. The Committee followed the hearing with letters to those institutions on January 10, 202
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Honest Reviews of Tim Han LMA Course Program.pptxtimhan337
Personal development courses are widely available today, with each one promising life-changing outcomes. Tim Han’s Life Mastery Achievers (LMA) Course has drawn a lot of interest. In addition to offering my frank assessment of Success Insider’s LMA Course, this piece examines the course’s effects via a variety of Tim Han LMA course reviews and Success Insider comments.
Acetabularia Information For Class 9 .docxvaibhavrinwa19
Acetabularia acetabulum is a single-celled green alga that in its vegetative state is morphologically differentiated into a basal rhizoid and an axially elongated stalk, which bears whorls of branching hairs. The single diploid nucleus resides in the rhizoid.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Embracing GenAI - A Strategic ImperativePeter Windle
Artificial Intelligence (AI) technologies such as Generative AI, Image Generators and Large Language Models have had a dramatic impact on teaching, learning and assessment over the past 18 months. The most immediate threat AI posed was to Academic Integrity with Higher Education Institutes (HEIs) focusing their efforts on combating the use of GenAI in assessment. Guidelines were developed for staff and students, policies put in place too. Innovative educators have forged paths in the use of Generative AI for teaching, learning and assessments leading to pockets of transformation springing up across HEIs, often with little or no top-down guidance, support or direction.
This Gasta posits a strategic approach to integrating AI into HEIs to prepare staff, students and the curriculum for an evolving world and workplace. We will highlight the advantages of working with these technologies beyond the realm of teaching, learning and assessment by considering prompt engineering skills, industry impact, curriculum changes, and the need for staff upskilling. In contrast, not engaging strategically with Generative AI poses risks, including falling behind peers, missed opportunities and failing to ensure our graduates remain employable. The rapid evolution of AI technologies necessitates a proactive and strategic approach if we are to remain relevant.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
1. Big Data and Automated Content Analysis
Week 5 – Monday
»Automated content analysis with NLP and
regular expressions«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
25 April 2014
2. Regular expressions Some more Natural Language Processing Take-home message & next steps
Today
1 ACA using regular expressions
What is a regexp?
Using a regexp in Python
2 Some more Natural Language Processing
Stemming
Parsing sentences
3 Take-home message & next steps
Big Data and Automated Content Analysis Damian Trilling
3. Regular expressions Some more Natural Language Processing Take-home message & next steps
Automated content analysis using regular expressions
Big Data and Automated Content Analysis Damian Trilling
4. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
Big Data and Automated Content Analysis Damian Trilling
5. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
Big Data and Automated Content Analysis Damian Trilling
6. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Regular Expressions: What and why?
What is a regexp?
• a very widespread way to describe patterns in strings
• Think of wildcards like * or operators like OR, AND or NOT in
search strings: a regexp does the same, but is much more
powerful
• You can use them in many editors (!), in the Terminal, in
STATA . . . and in Python
Big Data and Automated Content Analysis Damian Trilling
7. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
An example
From last week’s task
• We wanted to remove everything but words from a tweet
• We did so by calling the .replace() method
• We could do this with a regular expression as well:
[ˆa-zA-Z] would match anything that is not a letter
Big Data and Automated Content Analysis Damian Trilling
8. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Big Data and Automated Content Analysis Damian Trilling
9. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
Basic regexp elements
Alternatives
[TtFf] matches either T or t or F or f
Twitter|Facebook matches either Twitter or Facebook
. matches any character
Repetition
* the expression before occurs 0 or more times
+ the expression before occurs 1 or more times
Big Data and Automated Content Analysis Damian Trilling
10. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
Big Data and Automated Content Analysis Damian Trilling
11. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
Big Data and Automated Content Analysis Damian Trilling
12. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
regexp quizz
Which words would be matched?
1 [Pp]ython
2 [A-Z]+
3 RT :* @[a-zA-Z0-9]*
Big Data and Automated Content Analysis Damian Trilling
13. Regular expressions Some more Natural Language Processing Take-home message & next steps
What is a regexp?
What else is possible?
If you google regexp or regular expression, you’ll get a bunch
of useful overviews. The wikipedia page is not too bad, either.
Big Data and Automated Content Analysis Damian Trilling
14. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
Big Data and Automated Content Analysis Damian Trilling
15. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.findall("[Tt]witter|[Ff]acebook",testo) returns a list
with all occurances of Twitter or Facebook in the
string called testo
re.findall("[0-9]+[a-zA-Z]+",testo) returns a list with all
words that start with one or more numbers followed
by one or more letters in the string called testo
re.sub("[Tt]witter|[Ff]acebook","a social medium",testo)
returns a string in which all all occurances of Twitter
or Facebook are replaced by "a social medium"
Big Data and Automated Content Analysis Damian Trilling
16. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
How to use regular expressions in Python
The module re
re.match(" +([0-9]+) of ([0-9]+) points",line) returns
None unless it exactly matches the string line. If it
does, you can access the part between () with the
.group() method.
Example:
1 line=" 2 of 25 points"
2 result=re.match(" +([0-9]+) of ([0-9]+) points",line)
3 if result:
4 print ("Your points:",result.group(1))
5 print ("Maximum points:",result.group(2))
Your points: 2
Maximum points: 25
Big Data and Automated Content Analysis Damian Trilling
17. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data preprocessing
• Remove unwanted characters, words, . . .
• Identify meaningful bits of text: usernames, headlines, where
an article starts, . . .
• filter (distinguish relevant from irrelevant cases)
Big Data and Automated Content Analysis Damian Trilling
18. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Possible applications
Data analysis: Automated coding
• Actors
• Brands
• links or other markers that follow a regular pattern
• Numbers (!)
Big Data and Automated Content Analysis Damian Trilling
19. Example 1: Counting actors
1 import re, csv
2 from os import listdir, path
3 mypath ="/home/damian/artikelen"
4 filename_list=[]
5 matchcount54_list=[]
6 matchcount10_list=[]
7 onlyfiles = [f for f in listdir(mypath) if path.isfile(path.join(mypath,
f))]
8 for f in onlyfiles:
9 matchcount54=0
10 matchcount10=0
11 with open(path.join(mypath,f),mode="r",encoding="utf-8") as fi:
12 artikel=fi.readlines()
13 for line in artikel:
14 matches54 = re.findall(’Israel.*(minister|politician.*|[Aa]
uthorit)’,line)
15 matches10 = re.findall(’[Pp]alest’,line)
16 matchcount54+=len(matches54)
17 matchcount10+=len(matches10)
18 filename_list.append(f)
19 matchcount54_list.append(matchcount54)
20 matchcount10_list.append(matchcount10)
21 output=zip(filename_list,matchcount10_list,matchcount54_list)
22 with open("overzichtstabel.csv", mode=’w’,encoding="utf-8") as fo:
23 writer = csv.writer(fo)
24 writer.writerows(output)
20. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Which number has this Lexis Nexis article?
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
Big Data and Automated Content Analysis Damian Trilling
21. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Example 2: Check the number of a lexis nexis article
1 All Rights Reserved
2
3 2 of 200 DOCUMENTS
4
5 De Telegraaf
6
7 21 maart 2014 vrijdag
8
9 Brussel bereikt akkoord aanpak probleembanken;
10 ECB krijgt meer in melk te brokkelen
11
12 SECTION: Finance; Blz. 24
13 LENGTH: 660 woorden
14
15 BRUSSEL Europa heeft gisteren op de valreep een akkoord bereikt
16 over een saneringsfonds voor banken. Daarmee staat de laatste
1 for line in tekst:
2 matchObj=re.match(r" +([0-9]+) of ([0-9]+) DOCUMENTS",line)
3 if matchObj:
4 numberofarticle= int(matchObj.group(1))
5 totalnumberofarticles= int(matchObj.group(2))
Big Data and Automated Content Analysis Damian Trilling
22. Regular expressions Some more Natural Language Processing Take-home message & next steps
Using a regexp in Python
Practice yourself!
http://www.pyregex.com/
Big Data and Automated Content Analysis Damian Trilling
23. Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more Natural Language Processing
Big Data and Automated Content Analysis Damian Trilling
24. Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
Big Data and Automated Content Analysis Damian Trilling
25. Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
Big Data and Automated Content Analysis Damian Trilling
26. Regular expressions Some more Natural Language Processing Take-home message & next steps
Some more NLP: What and why?
What can we do?
• remove stopwords (last week)
• stemming
• Parse sentences (advanced)
Big Data and Automated Content Analysis Damian Trilling
27. Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
NLP: What and why?
Why do stemming?
• Because we do not want to distinguish between smoke,
smoked, smoking, . . .
• Typical preprocessing step (like stopword removal)
Big Data and Automated Content Analysis Damian Trilling
28. Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming
(with NLTK, see Bird, S., Loper, E., & Klein, E. (2009). Natural language processing
with Python. Sebastopol, CA: O’Reilly.)
1 from nltk.stem.snowball import SnowballStemmer
2 stemmer=SnowballStemmer("english")
3 frase="I am running while generously greeting my neighbors"
4 frasenuevo=""
5 for palabra in frase.split():
6 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
If we now did print(frasenuevo), it would return:
1 i am run while generous greet my neighbor
Big Data and Automated Content Analysis Damian Trilling
29. Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
Big Data and Automated Content Analysis Damian Trilling
30. Regular expressions Some more Natural Language Processing Take-home message & next steps
Stemming
Stemming and stopword removal - let’s combine them!
1 from nltk.stem.snowball import SnowballStemmer
2 from nltk.corpus import stopwords
3 stemmer=SnowballStemmer("english")
4 stopwords = stopwords.words("english")
5 frase="I am running while generously greeting my neighbors"
6 frasenuevo=""
7 for palabra in frase.lower().split():
8 if palabra not in stopwords:
9 frasenuevo=frasenuevo + stemmer.stem(palabra) + " "
Now, print(frasenuevo) returns:
1 run generous greet neighbor
Perfect!
In order to use nltk.corpus.stopwords, you have to download that module once. You can do so by typing the
following in the Python console and selecting the appropriate package from the menu that pops up:
import nltk
nltk.download()
NB: Don’t download everything, that’s several GB.
Big Data and Automated Content Analysis Damian Trilling
31.
32. Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
NLP: What and why?
Why parse sentences?
• To find out what grammatical function words have
• and to get closer to the meaning.
Big Data and Automated Content Analysis Damian Trilling
33. Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
1 import nltk
2 sentence = "At eight o’clock on Thursday morning, Arthur didn’t feel
very good."
3 tokens = nltk.word_tokenize(sentence)
4 print (tokens)
nltk.word_tokenize(sentence) is similar to sentence.split(),
but compare handling of punctuation and the didn’t in the
output:
1 [’At’, ’eight’, "o’clock", ’on’, ’Thursday’, ’morning’,’Arthur’, ’did’,
"n’t", ’feel’, ’very’, ’good’, ’.’]
Big Data and Automated Content Analysis Damian Trilling
34. Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
Big Data and Automated Content Analysis Damian Trilling
35. Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
Parsing a sentence
Now, as the next step, you can “tag” the tokenized sentence:
1 tagged = nltk.pos_tag(tokens)
2 print (tagged[0:6])
gives you the following:
1 [(’At’, ’IN’), (’eight’, ’CD’), ("o’clock", ’JJ’), (’on’, ’IN’),
2 (’Thursday’, ’NNP’), (’morning’, ’NN’)]
And you could get the word type of "morning" with
tagged[5][1]!
Big Data and Automated Content Analysis Damian Trilling
36. Regular expressions Some more Natural Language Processing Take-home message & next steps
Parsing sentences
More NLP
Look at http://nltk.org
Big Data and Automated Content Analysis Damian Trilling
37. Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home message
Take-home exam
Next meetings
Big Data and Automated Content Analysis Damian Trilling
38. Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home messages
What you should be familiar with:
• Steps to preprocessing the data
• Regular expressions
• Word counts and co-occurrences
Big Data and Automated Content Analysis Damian Trilling
39. Regular expressions Some more Natural Language Processing Take-home message & next steps
Next meetings: Week 6
This Wednesday is Koningsdag; no meeting. But take some time
this week for recapping today’s topics (= Chapter 6)
Monday
Scraping and parsing (lecture)
Wednesday
Scraping and parsing (exercises)
Big Data and Automated Content Analysis Damian Trilling
40. Regular expressions Some more Natural Language Processing Take-home message & next steps
Take-home exam
You will receive detailed instructions NOW.
Grading criteria
• Task 1: correctness, comprehensiveness, depth, argumentation
• Task 2: comprehensiveness, feasibility, usefulness,
argumentation, creativity, detail/reproducibility
• Task 3: does it run?; does it make sense?; advancedness of
analys[ie]s
Big Data and Automated Content Analysis Damian Trilling