Requirements Engineering: focus on Natural Language Processing, Lecture 2

Requirements Engineering:
Focus on Natural Language
Requirements Processing
Alessio Ferrari1
1ISTI-CNR, Pisa, Italy
January 15, 2016
A. Ferrari (ISTI) Requirements Engineering 1 / 88
cf. https://docplayer.net/16051752-Introduction-to-programming-languages-and-
techniques-xkcd-com-full-python-tutorial.html

Objectives of the Course
Objectives: lesson 1
Understand what are requirements, speciﬁcations
and domain assertions
Understand the problems behind requirements elicitation
and SRS deﬁnition
Learn how to write a decent SRS
Objectives: lesson 2
Learn the basics of Natual Language Processing (NLP)
Learn the relation between NLP and Requirements
Learn the basics of GATE to detect NL ambiguities in
requirements
Learn the basics of Python NLTK (for further manipulation of NL
requirements)

NLP and Requirements Engineering

Natural Language Processing
Technologies enabling extraction and manipulation of information from
Natural Language (NL)
Dan$Jurafsky$
Language(Technology(
Coreference$resoluIon$
QuesIon$answering$(QA)$
PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$
Named$enIty$recogniIon$(NER)$
Parsing$
SummarizaIon$
InformaIon$extracIon$(IE)$
Machine$translaIon$(MT)$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
making$good$progress$
sIll$really$hard$
Spam$detecIon$
Let’s$go$to$Agra!$
Buy$V1AGRA$…$
✓
✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$oﬃcials$in$Princeton$
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$
party,$Friday$May$27$at$8:30$
Party$
May$27$
add$
Best$roast$chicken$in$San$Francisco!$
The$waiter$ignored$us$for$20$minutes.$
Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$
13 …
The$Dow$Jones$is$up$
Housing$prices$rose$
Economy$is$
good$
Q.$How$eﬀecIve$is$ibuprofen$in$reducing$
fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$
ABC$has$been$taken$over$by$XYZ$
Where$is$CiIzen$Kane$playing$in$SF?$$
Castro$Theatre$at$7:30.$Do$
you$want$a$Icket?$
The$S&P500$jumped$
Dan$Jurafsky$
Paraphrase$
Parsing$
SummarizaIon$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
sIll$really$hard$
Spam$detecIon$
Buy$V1AGRA$…$
✓
✗
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
Party$
May$27$
add$
13 …
Economy$is$
good$
you$want$a$Icket?$
The$S&P500$jumped$
Dan$Jurafsky$
Paraphrase$
Parsing$
SummarizaIon$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
sIll$really$hard$
Spam$detecIon$
Buy$V1AGRA$…$
✓
✗
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
Party$
May$27$
add$
13 …
Economy$is$
good$
you$want$a$Icket?$
The$S&P500$jumped$

Natural Language Processing
Language is inherently ambiguous and hard to be handled
I Imagine a well-defined syntax and semantics for an input message:
you can write software to process it
I With NL you do not have a well defined syntax and semantics:
there is always some exception
Rule-based approaches are hard to build
Luckily, there’s machine-learning! . . . but
I :( Large amounts of data are needed for training/test/validation
I :( Tagging data requires domain expertise
I :( Requirements use domain-specific jargon (features based on
terms might work only for a specific company)
I :( Requirements structure vary from company to company (features
based on syntax might need adjustments)
I :( Some tasks (e.g., ambiguity) have too many negative cases (i.e.,
well-written), and too few positive (i.e., ambiguous)

4 Dimensions of Natural Language Requirements Processing
Discipline: requirements have to be non-ambiguous but
readable, no equivalent requirement
Dynamism: requirements evolve, and have to be traced and
categorised
Domain Knowledge: domain knowledge has to be gathered to
understand requirements, proper stakeholders have to be
identiﬁed
Data-sets: tagged data-sets are required, requirements are
private

4-D in the Requirements Process
Elicitation Analysis
ValidationManagement
Traceability
Domain Term
Identiﬁcation
Stakeholder
Identiﬁcation
Domain Scoping
Ambiguity
Readability
ComplianceCategorisation
Equivalence
D
I
S
C
I
P
L
I
N
E
DOMAIN KNOWLEDGE
DYNAMISM
DATA-SETS
Retrieval of Internal
Requirements
Market Analysis

Focus on Discipline
An ambiguity in a requirement might cause severe consequences
along the development
False negatives are NOT tolerable (i.e., I have to discover ALL
potential ambiguities)
It is preferable to have a certain amount of false positive cases,
which the engineer can easily discard
Rec = | true ambiguities items retrieved | / | true ambiguities |
Prec = | true ambiguities items retrieved | / | items retrieved |
Hence, a system for ambiguity detection has to provide
100% recall!
Rule-based approaches are preferred
However, for some tasks, rule-based approaches give really too many
false positives.

What you’ll learn
Rule-based approaches
Basic tools that you can use to implement rule-based approaches
These tools can be used as a basis if you want to move to
machine learning

NLP Minimal Concepts and Terminology

NLP Minimal Concepts
Every task in NLP needs to perform Text Normalization:
Segmenting/Tokenizing words in the text
Normalizing word formats (Stemming and Lemmatization)
Segmenting/Tokenizing sentences
Besides that, many tasks need Part-Of-Speech (POS) Tagging

NLP Minimal Concepts - Tokenization (or Word Segmentation)
A token is an occurrence of a WORD in a stream of text
Tokenization split a stream of text into a list of WORDS
Dan(Jurafsky(
How(many(words?(
they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their(
•  Type:(an(element(of(the(vocabulary.(
•  Token:(an(instance(of(that(type(in(running(text.(
•  How(many?(
•  15(tokens((or(14)(
•  13(types((or(12)((or(11?)(

NLP Minimal Concepts - Multi-word Terms
Multi-word Terms
Example: “The wayside equipment shall respond within 5
seconds”
wayside equipment is a multi-word term like San Francisco
We do not discuss about multi-word terms, but keep them in mind
because they are frequent in requirements

NLP Minimal Concepts - Lemmatization
Dan(Jurafsky(
Lemma4za4on(
•  Reduce(inflecGons(or(variant(forms(to(base(form(
•  am,"are,(is"→(be(
•  car,"cars,"car's,(cars'(→(car"
•  the"boy's"cars"are"different"colors(→(the"boy"car"be"different"color"
•  LemmaGzaGon:(have(to(find(correct(dicGonary(headword(form(
•  Machine(translaGon(
•  Spanish(quiero((‘I(want’),(quieres((‘you(want’)(same(lemma(as(querer(‘want’(

NLP Minimal Concepts - Stemming
Dan(Jurafsky(
Stemming(
•  Reduce(terms(to(their(stems(in(informaGon(retrieval(
•  Stemming(is(crude(chopping(of(aﬃxes(
•  language(dependent(
•  e.g.,(automate(s),+automaGc,+automaGon(all(reduced(to(automat.(
for"example"compressed""
and"compression"are"both""
accepted"as"equivalent"to""
compress.(
for(exampl(compress(and(
compress(ar(both(accept(
as(equival(to(compress(

NLP Minimal Concepts - Sentence SegmentationDan(Jurafsky(
Sentence(Segmenta4on(
•  !,(?(are(relaGvely(unambiguous(
•  Period(“.”(is(quite(ambiguous(
•  Sentence(boundary(
•  AbbreviaGons(like(Inc.(or(Dr.(
•  Numbers(like(.02%(or(4.3(
•  Build(a(binary(classiﬁer(
•  Looks(at(a(“.”(
•  Decides(EndOfSentence/NotEndOfSentence(
•  Classiﬁers:(handkwri@en(rules,(regular(expressions,(or(machineklearning(

NLP Minimal Concepts - POS Tagging
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
Interjections Ow Eh

NLP Minimal Concepts - POS TaggingChristopher"Manning"
POS-Tagging-
•  Words"oYen"have"more"than"one"POS:"back%
•  The"back"door"="JJ"
•  On"my"back"="NN"
•  Win"the"voters"back"="RB"
•  Promised"to"back"the"bill"="VB"
•  The"POS"tagging"problem"is"to"determine"the"POS"tag"for"a"
par1cular"instance"of"a"word."
https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html

GATE Basic Concepts

IR vs IE
Information Retrieval (IR): pulls documents from large corpora
Information Extraction (IE): retrieves facts and structured
information from large corpora
IR returns documents containing the relevant information
somewhere (normally fast)
IE returns precise and structured information (can be slow)
GATE (General Architecture for Text Engineering) supports IE
Potential Usage
Entity, Events, Relation extraction
Enrich features for retrieval
Annotate documents for ML
Ambiguity identiﬁcation with rule-based approaches

Document, Corpus, Annotation, Feature
Document = text + annotations + features
Corpus = collection of documents
Linguistic information in documents is encoded in the form of
annotations (like colored mark-ups)
Annotations have features with relative types and values
Example
Annotation: Sentence Length
Feature 1: Length in Characters, Value 1 = 100
Feature 2: Length in Token, Value 2 = 15

Document, Corpus, Annotation, Feature

Processing Resources (PR)
PR are algorithms that make NLP easy
Document Reset PR - delete Annotations
ANNIE English Tokeniser - identify tokens (words, numbers, etc.)
ANNIE Sentence Splitter - identify sentence boundaries
ANNIE Gazetteer - identify speciﬁc tokens in a list
ANNIE POS Tagger - identify part-of-speech (POS), like name,
adjective, etc.
JAPE Transducers - user-deﬁned annotations based on regular
expressions over annotations
ANNIE collects all the algorithms and run them in PIPELINE

Using Processing Resources in GATE
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Right-click>Processing Resources (then select the resources
reported in the next slides)

ANNIE English Tokenizer
Produces Token annotations

ANNIE Sentence Splitter
Produces Sentence annotations

Gazetteer
Produces Lookup annotations

ANNIE POS Tagger
Modiﬁes the Token annotations with a new feature category = NN, JJ,
VB, etc.

JAPE Transducers
Gazetteer lists are designed for annotating simple, regular
features
Even identifying simple patterns like e-mails is impossible with a
Gazetteer
What is JAPE?
JAPE provides pattern matching in GATE
Each JAPE rule consists of:
I LHS which contains patterns to match
I RHS which details the annotations (and optionally features) to be
created

Example
I want to ﬁnd all the occurrences of the term “level” followed by a
number (level 1, level 2, etc.)

Example (continue)
Making it case-insensitive

Example (continue)
Adding features and values

GATE for Ambiguity Detection

Steps - Detecting Vagueness
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Copy file exercises/vagueness.lst to
/GATE_Developer_8.0/plugins/ANNIE/resources/gazetteer
Open file lists.def (here you have all the file pointers to the lists
checked by the Look-up)
Add the line: vagueness.lst:vague_term

Steps - Detecting Vagueness (continue)
Go back to GATE
Processing Resources>New>ANNIE Gazetteer, and give it a
name
Check: case sensitive = false
Click on the Gazetteer, and check the presence of vagueness.lst
Applications>ANNIE
Move the Gazetteer to the right panel
Run!

Steps - Annotating Vagueness
Processing Resources>New>Jape Transducer
Load annotate_vagueness.jape
Applications>ANNIE
Move the JAPE rule to the right panel
Run!

Coordination Ambiguities
Detect all sentences that include a potential coordination
ambiguity
Coordination ambiguities are of two types:
I Example: The system shall produce the plot or the data-log and the
legend.
I Example: New employees and directors shall be provided with a
user-name and password.

Coordination Ambiguities - Jape Rule 0
Retrieve all occurrences of And / Or
Phase: MatchAndOr
Input: Token
Options: control = appelt
//Operator ==~ returns only whole string matches
//Operator (?i) tells that the matching is case insensitive
//Operator | is the logical "or" operator
Rule: checkAndOr
Priority: 1
(
{Token.string ==~ "(?i)and"} |
{Token.string ==~ "(?i)or"}
):coord
-->
:coord.AndOr = {}

Annotate all the sequences of And/Or in the same sentence
Phase: MatchCoordinationSequences
Input: Split AndOr
//Note that, having Split among the input Annotations allows
//us to identify sequences of And/Or in the same sentence
//The ’+’ operator means "one or more occurrences"
//The ’*’ operator means "zero or more occurrences"
//The ’?’ operator means "zero or one occurrences"
Rule: checkCoordinationSequences
Priority: 1
(
{AndOr}
({AndOr})+
):coordSequence
-->
:coordSequence.AndOrSequence = {}

Annotate all the sentences with And/Or sequences
Phase: MatchCoordinationSentences
Input: Sentence AndOrSequence
//A "contains" B: searches for
//annotations A that contain annotations B
Rule: checkCoordinationSentences
Priority: 1
(
{Sentence contains AndOrSequence}
):coordSentence
-->
:coordSentence.AndOrSentence = {}

Your Turn! - Detect Coordination Ambiguities of the Second Type
Add the Example: new employees and directors shall be
provided with a user-name and password
Remember that ANNIE POS Tagger gives you the category of the
Token
JJ = Adjective, NN = Noun, NNS = Plural Noun
More information on the tag-set, and many other things, can be
found in: https://gate.ac.uk/sale/tao/tao.pdf

Python and NLTK

Python Basic Concepts
Python Online! http://www.learnpython.org/
Tutorial: https://docs.python.org/2/tutorial/

Python - Main Features
Interpreted language
Dynamically typed (variables do not have a predeﬁned type!)
Rich, built-in collection types
I Lists
I Tuples
I Dictionaries
I Sets
Concise
My Humble Opinion
It is perfect if you want to develop something quickly
It is a mess if you want to develop a structured software

Which Python?
Python 2.7: that’s the one we’ll use here
I Last stable release before version 3
I Implements parts of the version 3 features
I Fully backwards compatible
Python 3.0: that’s the one you should use in the future
I Released on Dec. 3, 2008
I Many changes (also incompatible ones)
I . . . But: few important third-party libraries are not yet compatible
with Python 3.0
I Version compatible with NLTK 3.0: either 2.7, or 3.2+
Hint: if you want to use NLTK, go to:
http://www.nltk.org/install.html, and check the links
to the additional libraries and python recommended releases,
before downloading Python

Python - Example
A Code Sample (in IDLE)
x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x = x + 1
y = y + “ World” # String concat.
print x
print y

Python - Enough to Understand the Code
Indentation matters to the meaning of the code
The ﬁrst assignment to a variable creates it (x = 3)
I Variable types do not need to be declared
I Python ﬁgures out the variable types on its own
Logical operators are words (and, or, not), NOT symbols
Simple printing can be done with print
Use a newline to end a line of code
Use when must go to next line prematurely

Python - Assignment
Binding a variable in Python means setting a name to hold a
reference to some object
Hence, assignment creates references, not copies (like in Java)
An object is deleted (by the garbage collector) once it becomes
unreachable

Python - Language Features
Indentation instead of braces
Several sequence types
I Strings (immutable):
’This is a string’, and “Also this is a string”
I Tuples (immutable):
(’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’)
I Lists (mutable):
[’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’]
Lists and Tuples can contain anything:
I ((’REQ’, 1), ’the system shall measure the heartbeat’)
I [((’REQ’, 1), ’the system shall measure the heartbeat’), ((’REQ’, 2),
’the system shall alert the nurse’))]

Python - Sequence Types
Tuples are defined using parentheses (and commas).
tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
Lists are defined using square brackets (and commas)
li = [“abc”, 34, 4.34, 23]
Strings are defined using quotes (“, ‘, or “““)
st = “Hello World”
st = ‘Hello World’
st = “““Hello World ”””

Python - Sequence Types
We can access individual members of a tuple, list, or string using
square bracket “array” notation.
We can access individual members of a tuple, lis
using square bracket “array” notation.
Note that all are 0 based…
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1] # Second item in the list.
34
>>> st = “Hello World”
>>> st[1] # Second character in string.
‘e’

Python - Negative IndicesNegative indices
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Positive index: count from the left, starting with 0.
>>> t[1]
‘abc’
Negative lookup: count from right, starting with –1.
>>> t[-3]
4.56

Python - Slicing
30
Slicing: Return Copy of a Subset 1
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Return a copy of the container with a subset of the original
members. Start copying at the first index, and stop copying
before the second index.
>>> t[1:4]
(‘abc’, 4.56, (2,3))
You can also use negative indices when slicing.
>>> t[1:-1]
(‘abc’, 4.56, (2,3))
Optional argument allows selection of every nth item.
>>> t[1:-1:2]
(‘abc’, (2,3))

Python - ‘in’ OperatorThe ‘in’ Operator
Boolean test whether a value is inside a collection (often
called a container in Python:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False
For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
Be careful: the in keyword is also used in the syntax of
for loops and list comprehensions.

Python - ‘+’ Operator
34
The + Operator
The + operator produces a new tuple, list, or string whose
value is the concatenation of its arguments.
Extends concatenation from strings to other types
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> “Hello” + “ ” + “World”
‘Hello World’

Lists - Mutable
36
Lists: Mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’, 45, 4.34, 23]
We can change lists in place.
Name li still points to the same memory reference when
we’re done.

Tuples - Immutable
37
Tuples: Immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
You can’t change a tuple.
You can make a fresh tuple and assign its reference to a previously
used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
The immutability of tuples means they’re faster than lists.

Operations on Lists
38
Operations on Lists Only 1
>>> li = [1, 11, 3, 4, 5]
>>> li.append(‘a’) # Note the method syntax
>>> li
[1, 11, 3, 4, 5, ‘a’]
>>> li.insert(2, ‘i’)
>>>li
[1, 11, ‘i’, 3, 4, 5, ‘a’]

Dictionaries
Dictionaries store a mapping between a set of keys and a set of
values
Keys can be any immutable type
Values can be any type

Dictionaries - Creating and AccessingCreating and accessing
dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’]
‘bozo’
>>> d[‘pswd’]
1234
>>> d[‘bozo’]
Traceback (innermost last):
File ‘<interactive input>’ line 1, in ?
KeyError: bozo

Dictionaries - Updating
Updating Dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’] = ‘clown’
>>> d
{‘user’:‘clown’, ‘pswd’:1234}
Keys must be unique.
Assigning to an existing key replaces its value.
>>> d[‘id’] = 45
>>> d
{‘user’:‘clown’, ‘id’:45, ‘pswd’:1234}
Dictionaries are unordered
New entry might appear anywhere in the output.
(Dictionaries work by hashing)

Dictionaries - Removing
Removing dictionary entries
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> del d[‘user’] # Remove one. Note that del is
# a function.
>>> d
{‘p’:1234, ‘i’:34}
>>> d.clear() # Remove all.
>>> d
{}
>>> a=[1,2]
>>> del a[1] # (del also works on lists)
>>> a
[1]
47

Dictionaries - Useful Methods
Useful Accessor Methods
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> d.keys() # List of current keys
[‘user’, ‘p’, ‘i’]
>>> d.values() # List of current values.
[‘bozo’, 1234, 34]
>>> d.items() # List of item tuples.
[(‘user’,‘bozo’), (‘p’,1234), (‘i’,34)]

If Statements (as expected)
if Statements (as expected)
if x == 3:
print "X equals 3."
elif x == 2:
print "X equals 2."
else:
print "X equals something else."
print "This is outside the ‘if’."
Note:
Use of indentation for blocks
Colon (:) after boolean expression
54

While Loops (as expected)
while Loops (as expected)
>>> x = 3
>>> while x < 5:
print x, "still in the loop"
x = x + 1
3 still in the loop
4 still in the loop
>>> x = 6
>>> while x < 5:
print x, "still in the loop"
>>>
55

For Loops (very cool)For Loops 1
For-each is Python’s only form of for loop
A for loop steps through each of the items in a collection
type, or any other type of object which is “iterable”
for <item> in <collection>:
<statements>
If <collection> is a list or a tuple, then the loop steps
through each element of the sequence.
If <collection> is a string, then the loop steps through each
character of the string.
for someChar in “Hello World”:
print someChar
59

For Loops
Items can be more complex than a simple variable name
<statements>
<item> can be more complex than a single
variable name.
If the elements of <collection> are themselves collections,
then <item> can match the structure of the elements. (We
saw something similar with list comprehensions and with
ordinary assignments.)
for (x, y) in [(a,1), (b,2), (c,3), (d,4)]:
print x
60

For Loops and range() Function
For loops and the range() function
We often want to write a loop where the variables ranges
over some sequence of numbers. The range() function
returns a list of numbers from 0 up to but not including the
number we pass to it.
range(5) returns [0,1,2,3,4]
So we can say:
for x in range(5):
print x
(There are several other forms of range() that provide
variants of this functionality…)
xrange() returns an iterator that provides the same
functionality more efficiently
61

For Loops and range() Function
Abuse of the range() function
Don't use range() to iterate over a sequence solely to have
the index and elements available at the same time
Avoid:
for i in range(len(mylist)):
print i, mylist[i]
Instead:
for (i, item) in enumerate(mylist):
print i, item
This is an example of an anti-pattern in Python
For more, see:
http://www.seas.upenn.edu/~lignos/py_antipatterns.html
http://stackoverflow.com/questions/576988/python-specific-antipatterns-and-bad-
practices

List Comprehensions (remarkably cool)
List Comprehensions 1
A powerful feature of the Python language.
Generate a new list by applying a function to every member
of an original list.
Python programmers use list comprehensions extensively.
You’ll see many of them in real code.
[ expression for name in list ]
64

>>> li = [3, 6, 2, 7]
>>> [elem*2 for elem in li]
[6, 12, 4, 14]
Where expression is some calculation or operation
acting upon the variable name.
For each member of the list, the list comprehension
1. sets name equal to that member, and
2. calculates a new value using expression,
It then collects these new values into a list which is the
return value of the list comprehension.
65

If the elements of list are other collections, then
name can be replaced by a collection of names
that match the “shape” of the list members.
>>> li = [(‘a’, 1), (‘b’, 2), (‘c’, 7)]
>>> [ n * 3 for (x, n) in li]
[3, 6, 21]
66

Filtered List Comprehensions (remarkably cool)
Filtered List Comprehension 1
Filter determines whether expression is performed
on each member of the list.
When processing each element of list, first check if
it satisfies the filter condition.
If the filter condition returns False, that element is
omitted from the list before the list comprehension
is evaluated.
[ expression for name in list if filter]
67

Filtered List Comprehensions (remarkably cool)
>>> li = [3, 6, 2, 7, 1, 9]
>>> [elem * 2 for elem in li if elem > 4]
[12, 14, 18]
Only 6, 7, and 9 satisfy the filter condition.
So, only 12, 14, and 18 are produced.
Filtered List Comprehension 2
[ expression for name in list if filter]

Functions
First line with less
indentation is considered to be
outside of the function definition.
Defining Functions
No declaration of types of arguments or result
def get_final_answer(filename):
"""Documentation String"""
line1
line2
return total_counter
...
Function definition begins with def Function name and its arguments.
‘return’ indicates the
value to be sent back to the caller.
Colon.
72

Function Call
Calling a Function
>>> def myfun(x, y):
return x * y
>>> myfun(3, 4)
12
73

Import
87
Import I
import somefile
Everything in somefile.py can be referred to by:
somefile.className.method(“abc”)
somefile.myFunction(34)
from somefile import *
Everything in somefile.py can be referred to by:
className.method(“abc”)
myFunction(34)
Careful! This can overwrite the definition of an existing
function or variable!

Python NLTK Basic Concepts
http://textminingonline.com/
dive-into-nltk-part-i-getting-started-with-nltk

Python NLTK Main Features
Sentence and word tokenization
POS tagging
Chunking and Named Entity Recognition (NER)
Text Classiﬁcation
Many included corpora

Python NLTK - sentence tokenization (splitting)
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("hello this is nltk")
[’hello this is nltk’]
>>> sent_tokenize("hello. this is nltk")
[’hello.’, ’this is nltk’]

Python NLTK - word tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("this is nltk")
[’this’, ’is’, ’nltk’]

Python NLTK - word tokenization with punctuation
>>> word_tokenize("The woman’s car")
[’The’, ’woman’, "’s", ’car’]
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("The woman’s car")
[’The’, ’woman’, "’", ’s’, ’car’]

Python NLTK - stemming
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’requirements’)
’requir’

Python NLTK - lemmatizer
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’requirements’)
’requirement’
>>> wordnet_lemmatizer.lemmatize(’is’)
’is’
>>> wordnet_lemmatizer.lemmatize(’is’, pos=’v’)
’be’

Python NLTK - POS tagging
>>> t = word_tokenize("The system shall be nice")
>>> from nltk import pos_tag
>>> pos_tag(t)
[(’The’, ’DT’), (’system’, ’NN’),
(’shall’, ’MD’), (’be’, ’VB’), (’nice’, ’JJ’)]

Python NLTK - POS tagging
How can I know the meaning of the tags?
>>> import nltk
>>> nltk.help.upenn_tagset(’DT’)
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
>>> nltk.help.upenn_tagset(’NN’)
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
>>> nltk.help.upenn_tagset(’NNP’)
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...

Python NLTK - stopword removal
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words(’english’)
>>> sentence = ’The system shall be usable’
>>> print [i for i in word_tokenize(sentence) if i not in stoplist]
[’The’, ’system’, ’shall’, ’usable’]
>>> print [i for i in word_tokenize(sentence.lower())
if i not in stoplist]
[’system’, ’shall’, ’usable’]

When to use What
GATE is good to annotate documents, and ﬁnd patterns
Python NLTK is good for fast prototyping of analysis tools
If you have to extract information from text: use GATE
If you have to do more ﬁne-grained analyses, and you do not need
to build a well-structured program: use Python NLTK
If you have to build a structured program: study the OpenNLP
library (Java)

Requirements Engineering: focus on Natural Language Processing, Lecture 2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Requirements Engineering: focus on Natural Language Processing, Lecture 2

Similar to Requirements Engineering: focus on Natural Language Processing, Lecture 2 (20)

More from alessio_ferrari

More from alessio_ferrari (8)

Recently uploaded

Recently uploaded (20)

Requirements Engineering: focus on Natural Language Processing, Lecture 2