Requirements Engineering:
Focus on Natural Language
Requirements Processing
Alessio Ferrari1
1ISTI-CNR, Pisa, Italy
January 15, 2016
A. Ferrari (ISTI) Requirements Engineering 1 / 88
cf. https://docplayer.net/16051752-Introduction-to-programming-languages-and-
techniques-xkcd-com-full-python-tutorial.html
Objectives of the Course
Objectives: lesson 1
Understand what are requirements, specifications
and domain assertions
Understand the problems behind requirements elicitation
and SRS definition
Learn how to write a decent SRS
Objectives: lesson 2
Learn the basics of Natual Language Processing (NLP)
Learn the relation between NLP and Requirements
Learn the basics of GATE to detect NL ambiguities in
requirements
Learn the basics of Python NLTK (for further manipulation of NL
requirements)
A. Ferrari (ISTI) Requirements Engineering 2 / 88
NLP and Requirements Engineering
A. Ferrari (ISTI) Requirements Engineering 3 / 88
Natural Language Processing
Technologies enabling extraction and manipulation of information from
Natural Language (NL)
Dan$Jurafsky$
Language(Technology(
Coreference$resoluIon$
QuesIon$answering$(QA)$
PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$
Named$enIty$recogniIon$(NER)$
Parsing$
SummarizaIon$
InformaIon$extracIon$(IE)$
Machine$translaIon$(MT)$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
making$good$progress$
sIll$really$hard$
Spam$detecIon$
Let’s$go$to$Agra!$
Buy$V1AGRA$…$
✓
✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$officials$in$Princeton$
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$
party,$Friday$May$27$at$8:30$
Party$
May$27$
add$
Best$roast$chicken$in$San$Francisco!$
The$waiter$ignored$us$for$20$minutes.$
Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$
13 …
The$Dow$Jones$is$up$
Housing$prices$rose$
Economy$is$
good$
Q.$How$effecIve$is$ibuprofen$in$reducing$
fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$
ABC$has$been$taken$over$by$XYZ$
Where$is$CiIzen$Kane$playing$in$SF?$$
Castro$Theatre$at$7:30.$Do$
you$want$a$Icket?$
The$S&P500$jumped$
Dan$Jurafsky$
Language(Technology(
Coreference$resoluIon$
QuesIon$answering$(QA)$
PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$
Named$enIty$recogniIon$(NER)$
Parsing$
SummarizaIon$
InformaIon$extracIon$(IE)$
Machine$translaIon$(MT)$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
making$good$progress$
sIll$really$hard$
Spam$detecIon$
Let’s$go$to$Agra!$
Buy$V1AGRA$…$
✓
✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$officials$in$Princeton$
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$
party,$Friday$May$27$at$8:30$
Party$
May$27$
add$
Best$roast$chicken$in$San$Francisco!$
The$waiter$ignored$us$for$20$minutes.$
Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$
13 …
The$Dow$Jones$is$up$
Housing$prices$rose$
Economy$is$
good$
Q.$How$effecIve$is$ibuprofen$in$reducing$
fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$
ABC$has$been$taken$over$by$XYZ$
Where$is$CiIzen$Kane$playing$in$SF?$$
Castro$Theatre$at$7:30.$Do$
you$want$a$Icket?$
The$S&P500$jumped$
Dan$Jurafsky$
Language(Technology(
Coreference$resoluIon$
QuesIon$answering$(QA)$
PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$
Named$enIty$recogniIon$(NER)$
Parsing$
SummarizaIon$
InformaIon$extracIon$(IE)$
Machine$translaIon$(MT)$
Dialog$
SenIment$analysis$
$$$
mostly$solved$
making$good$progress$
sIll$really$hard$
Spam$detecIon$
Let’s$go$to$Agra!$
Buy$V1AGRA$…$
✓
✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$officials$in$Princeton$
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$
party,$Friday$May$27$at$8:30$
Party$
May$27$
add$
Best$roast$chicken$in$San$Francisco!$
The$waiter$ignored$us$for$20$minutes.$
Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$
13 …
The$Dow$Jones$is$up$
Housing$prices$rose$
Economy$is$
good$
Q.$How$effecIve$is$ibuprofen$in$reducing$
fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$
ABC$has$been$taken$over$by$XYZ$
Where$is$CiIzen$Kane$playing$in$SF?$$
Castro$Theatre$at$7:30.$Do$
you$want$a$Icket?$
The$S&P500$jumped$
A. Ferrari (ISTI) Requirements Engineering 4 / 88
Natural Language Processing
Language is inherently ambiguous and hard to be handled
I Imagine a well-defined syntax and semantics for an input message:
you can write software to process it
I With NL you do not have a well defined syntax and semantics:
there is always some exception
Rule-based approaches are hard to build
Luckily, there’s machine-learning! . . . but
I :( Large amounts of data are needed for training/test/validation
I :( Tagging data requires domain expertise
I :( Requirements use domain-specific jargon (features based on
terms might work only for a specific company)
I :( Requirements structure vary from company to company (features
based on syntax might need adjustments)
I :( Some tasks (e.g., ambiguity) have too many negative cases (i.e.,
well-written), and too few positive (i.e., ambiguous)
A. Ferrari (ISTI) Requirements Engineering 5 / 88
4 Dimensions of Natural Language Requirements Processing
Discipline: requirements have to be non-ambiguous but
readable, no equivalent requirement
Dynamism: requirements evolve, and have to be traced and
categorised
Domain Knowledge: domain knowledge has to be gathered to
understand requirements, proper stakeholders have to be
identified
Data-sets: tagged data-sets are required, requirements are
private
A. Ferrari (ISTI) Requirements Engineering 6 / 88
4-D in the Requirements Process
Elicitation Analysis
ValidationManagement
Traceability
Domain Term
Identification
Stakeholder
Identification
Domain Scoping
Ambiguity
Readability
ComplianceCategorisation
Equivalence
D
I
S
C
I
P
L
I
N
E
DOMAIN KNOWLEDGE
DYNAMISM
DATA-SETS
Retrieval of Internal
Requirements
Market Analysis
A. Ferrari (ISTI) Requirements Engineering 7 / 88
Focus on Discipline
An ambiguity in a requirement might cause severe consequences
along the development
False negatives are NOT tolerable (i.e., I have to discover ALL
potential ambiguities)
It is preferable to have a certain amount of false positive cases,
which the engineer can easily discard
Rec = | true ambiguities  items retrieved | / | true ambiguities |
Prec = | true ambiguities  items retrieved | / | items retrieved |
Hence, a system for ambiguity detection has to provide
100% recall!
Rule-based approaches are preferred
However, for some tasks, rule-based approaches give really too many
false positives.
A. Ferrari (ISTI) Requirements Engineering 8 / 88
What you’ll learn
Rule-based approaches
Basic tools that you can use to implement rule-based approaches
These tools can be used as a basis if you want to move to
machine learning
A. Ferrari (ISTI) Requirements Engineering 9 / 88
NLP Minimal Concepts and Terminology
A. Ferrari (ISTI) Requirements Engineering 10 / 88
NLP Minimal Concepts
Every task in NLP needs to perform Text Normalization:
Segmenting/Tokenizing words in the text
Normalizing word formats (Stemming and Lemmatization)
Segmenting/Tokenizing sentences
Besides that, many tasks need Part-Of-Speech (POS) Tagging
A. Ferrari (ISTI) Requirements Engineering 11 / 88
NLP Minimal Concepts - Tokenization (or Word Segmentation)
A token is an occurrence of a WORD in a stream of text
Tokenization split a stream of text into a list of WORDS
Dan(Jurafsky(
How(many(words?(
they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their(
•  Type:(an(element(of(the(vocabulary.(
•  Token:(an(instance(of(that(type(in(running(text.(
•  How(many?(
•  15(tokens((or(14)(
•  13(types((or(12)((or(11?)(
A. Ferrari (ISTI) Requirements Engineering 12 / 88
NLP Minimal Concepts - Multi-word Terms
Multi-word Terms
Example: “The wayside equipment shall respond within 5
seconds”
wayside equipment is a multi-word term like San Francisco
We do not discuss about multi-word terms, but keep them in mind
because they are frequent in requirements
A. Ferrari (ISTI) Requirements Engineering 13 / 88
NLP Minimal Concepts - Lemmatization
Dan(Jurafsky(
Lemma4za4on(
•  Reduce(inflecGons(or(variant(forms(to(base(form(
•  am,"are,(is"→(be(
•  car,"cars,"car's,(cars'(→(car"
•  the"boy's"cars"are"different"colors(→(the"boy"car"be"different"color"
•  LemmaGzaGon:(have(to(find(correct(dicGonary(headword(form(
•  Machine(translaGon(
•  Spanish(quiero((‘I(want’),(quieres((‘you(want’)(same(lemma(as(querer(‘want’(
A. Ferrari (ISTI) Requirements Engineering 14 / 88
NLP Minimal Concepts - Stemming
Dan(Jurafsky(
Stemming(
•  Reduce(terms(to(their(stems(in(informaGon(retrieval(
•  Stemming(is(crude(chopping(of(affixes(
•  language(dependent(
•  e.g.,(automate(s),+automaGc,+automaGon(all(reduced(to(automat.(
for"example"compressed""
and"compression"are"both""
accepted"as"equivalent"to""
compress.(
for(exampl(compress(and(
compress(ar(both(accept(
as(equival(to(compress(
A. Ferrari (ISTI) Requirements Engineering 15 / 88
NLP Minimal Concepts - Sentence SegmentationDan(Jurafsky(
Sentence(Segmenta4on(
•  !,(?(are(relaGvely(unambiguous(
•  Period(“.”(is(quite(ambiguous(
•  Sentence(boundary(
•  AbbreviaGons(like(Inc.(or(Dr.(
•  Numbers(like(.02%(or(4.3(
•  Build(a(binary(classifier(
•  Looks(at(a(“.”(
•  Decides(EndOfSentence/NotEndOfSentence(
•  Classifiers:(handkwri@en(rules,(regular(expressions,(or(machineklearning(
A. Ferrari (ISTI) Requirements Engineering 16 / 88
NLP Minimal Concepts - POS Tagging
Open class (lexical) words
Closed class (functional)
Nouns Verbs
Proper Common
Modals
Main
Adjectives
Adverbs
Prepositions
Particles
Determiners
Conjunctions
Pronouns
… more
… more
IBM
Italy
cat / cats
snow
see
registered
can
had
old older oldest
slowly
to with
off up
the some
and or
he its
Numbers
122,312
one
Interjections Ow Eh
A. Ferrari (ISTI) Requirements Engineering 17 / 88
NLP Minimal Concepts - POS TaggingChristopher"Manning"
POS-Tagging-
•  Words"oYen"have"more"than"one"POS:"back%
•  The"back"door"="JJ"
•  On"my"back"="NN"
•  Win"the"voters"back"="RB"
•  Promised"to"back"the"bill"="VB"
•  The"POS"tagging"problem"is"to"determine"the"POS"tag"for"a"
par1cular"instance"of"a"word."
https://www.ling.upenn.edu/courses/Fall_2003/
ling001/penn_treebank_pos.html
A. Ferrari (ISTI) Requirements Engineering 18 / 88
GATE Basic Concepts
A. Ferrari (ISTI) Requirements Engineering 19 / 88
IR vs IE
Information Retrieval (IR): pulls documents from large corpora
Information Extraction (IE): retrieves facts and structured
information from large corpora
IR returns documents containing the relevant information
somewhere (normally fast)
IE returns precise and structured information (can be slow)
GATE (General Architecture for Text Engineering) supports IE
Potential Usage
Entity, Events, Relation extraction
Enrich features for retrieval
Annotate documents for ML
Ambiguity identification with rule-based approaches
A. Ferrari (ISTI) Requirements Engineering 20 / 88
Document, Corpus, Annotation, Feature
Document = text + annotations + features
Corpus = collection of documents
Linguistic information in documents is encoded in the form of
annotations (like colored mark-ups)
Annotations have features with relative types and values
Example
Annotation: Sentence Length
Feature 1: Length in Characters, Value 1 = 100
Feature 2: Length in Token, Value 2 = 15
A. Ferrari (ISTI) Requirements Engineering 21 / 88
Document, Corpus, Annotation, Feature
A. Ferrari (ISTI) Requirements Engineering 22 / 88
Processing Resources (PR)
PR are algorithms that make NLP easy
Document Reset PR - delete Annotations
ANNIE English Tokeniser - identify tokens (words, numbers, etc.)
ANNIE Sentence Splitter - identify sentence boundaries
ANNIE Gazetteer - identify specific tokens in a list
ANNIE POS Tagger - identify part-of-speech (POS), like name,
adjective, etc.
JAPE Transducers - user-defined annotations based on regular
expressions over annotations
ANNIE collects all the algorithms and run them in PIPELINE
A. Ferrari (ISTI) Requirements Engineering 23 / 88
Using Processing Resources in GATE
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Right-click>Processing Resources (then select the resources
reported in the next slides)
A. Ferrari (ISTI) Requirements Engineering 24 / 88
ANNIE English Tokenizer
Produces Token annotations
A. Ferrari (ISTI) Requirements Engineering 25 / 88
ANNIE Sentence Splitter
Produces Sentence annotations
A. Ferrari (ISTI) Requirements Engineering 26 / 88
Gazetteer
Produces Lookup annotations
A. Ferrari (ISTI) Requirements Engineering 27 / 88
ANNIE POS Tagger
Modifies the Token annotations with a new feature category = NN, JJ,
VB, etc.
A. Ferrari (ISTI) Requirements Engineering 28 / 88
JAPE Transducers
Gazetteer lists are designed for annotating simple, regular
features
Even identifying simple patterns like e-mails is impossible with a
Gazetteer
What is JAPE?
JAPE provides pattern matching in GATE
Each JAPE rule consists of:
I LHS which contains patterns to match
I RHS which details the annotations (and optionally features) to be
created
A. Ferrari (ISTI) Requirements Engineering 29 / 88
Example
I want to find all the occurrences of the term “level” followed by a
number (level 1, level 2, etc.)
A. Ferrari (ISTI) Requirements Engineering 30 / 88
Example (continue)
Making it case-insensitive
A. Ferrari (ISTI) Requirements Engineering 31 / 88
Example (continue)
Adding features and values
A. Ferrari (ISTI) Requirements Engineering 32 / 88
GATE for Ambiguity Detection
A. Ferrari (ISTI) Requirements Engineering 33 / 88
Steps - Detecting Vagueness
Open GATE
Language Resources > New > GATE Document
Load Source Document: exercises/requirements.txt
Right-click> New Corpus from this document
Copy file exercises/vagueness.lst to
/GATE_Developer_8.0/plugins/ANNIE/resources/gazetteer
Open file lists.def (here you have all the file pointers to the lists
checked by the Look-up)
Add the line: vagueness.lst:vague_term
A. Ferrari (ISTI) Requirements Engineering 34 / 88
Steps - Detecting Vagueness (continue)
Go back to GATE
Processing Resources>New>ANNIE Gazetteer, and give it a
name
Check: case sensitive = false
Click on the Gazetteer, and check the presence of vagueness.lst
Applications>ANNIE
Move the Gazetteer to the right panel
Run!
A. Ferrari (ISTI) Requirements Engineering 35 / 88
Steps - Annotating Vagueness
Processing Resources>New>Jape Transducer
Load annotate_vagueness.jape
Applications>ANNIE
Move the JAPE rule to the right panel
Run!
A. Ferrari (ISTI) Requirements Engineering 36 / 88
Coordination Ambiguities
Detect all sentences that include a potential coordination
ambiguity
Coordination ambiguities are of two types:
I Example: The system shall produce the plot or the data-log and the
legend.
I Example: New employees and directors shall be provided with a
user-name and password.
A. Ferrari (ISTI) Requirements Engineering 37 / 88
Coordination Ambiguities - Jape Rule 0
Retrieve all occurrences of And / Or
Phase: MatchAndOr
Input: Token
Options: control = appelt
//Operator ==~ returns only whole string matches
//Operator (?i) tells that the matching is case insensitive
//Operator | is the logical "or" operator
Rule: checkAndOr
Priority: 1
(
{Token.string ==~ "(?i)and"} |
{Token.string ==~ "(?i)or"}
):coord
-->
:coord.AndOr = {}
A. Ferrari (ISTI) Requirements Engineering 38 / 88
Coordination Ambiguities - Jape Rule 1
Annotate all the sequences of And/Or in the same sentence
Phase: MatchCoordinationSequences
Input: Split AndOr
Options: control = appelt
//Note that, having Split among the input Annotations allows
//us to identify sequences of And/Or in the same sentence
//The ’+’ operator means "one or more occurrences"
//The ’*’ operator means "zero or more occurrences"
//The ’?’ operator means "zero or one occurrences"
Rule: checkCoordinationSequences
Priority: 1
(
{AndOr}
({AndOr})+
):coordSequence
-->
:coordSequence.AndOrSequence = {}
A. Ferrari (ISTI) Requirements Engineering 39 / 88
Coordination Ambiguities - Jape Rule 2
Annotate all the sentences with And/Or sequences
Phase: MatchCoordinationSentences
Input: Sentence AndOrSequence
Options: control = appelt
//A "contains" B: searches for
//annotations A that contain annotations B
Rule: checkCoordinationSentences
Priority: 1
(
{Sentence contains AndOrSequence}
):coordSentence
-->
:coordSentence.AndOrSentence = {}
A. Ferrari (ISTI) Requirements Engineering 40 / 88
Your Turn! - Detect Coordination Ambiguities of the Second Type
Add the Example: new employees and directors shall be
provided with a user-name and password
Remember that ANNIE POS Tagger gives you the category of the
Token
JJ = Adjective, NN = Noun, NNS = Plural Noun
More information on the tag-set, and many other things, can be
found in: https://gate.ac.uk/sale/tao/tao.pdf
A. Ferrari (ISTI) Requirements Engineering 41 / 88
Python and NLTK
A. Ferrari (ISTI) Requirements Engineering 42 / 88
Python Basic Concepts
Python Online! http://www.learnpython.org/
Tutorial: https://docs.python.org/2/tutorial/
A. Ferrari (ISTI) Requirements Engineering 43 / 88
Python - Main Features
Interpreted language
Dynamically typed (variables do not have a predefined type!)
Rich, built-in collection types
I Lists
I Tuples
I Dictionaries
I Sets
Concise
My Humble Opinion
It is perfect if you want to develop something quickly
It is a mess if you want to develop a structured software
A. Ferrari (ISTI) Requirements Engineering 44 / 88
Which Python?
Python 2.7: that’s the one we’ll use here
I Last stable release before version 3
I Implements parts of the version 3 features
I Fully backwards compatible
Python 3.0: that’s the one you should use in the future
I Released on Dec. 3, 2008
I Many changes (also incompatible ones)
I . . . But: few important third-party libraries are not yet compatible
with Python 3.0
I Version compatible with NLTK 3.0: either 2.7, or 3.2+
Hint: if you want to use NLTK, go to:
http://www.nltk.org/install.html, and check the links
to the additional libraries and python recommended releases,
before downloading Python
A. Ferrari (ISTI) Requirements Engineering 45 / 88
Python - Example
A Code Sample (in IDLE)
x = 34 - 23 # A comment.
y = “Hello” # Another one.
z = 3.45
if z == 3.45 or y == “Hello”:
x = x + 1
y = y + “ World” # String concat.
print x
print y
A. Ferrari (ISTI) Requirements Engineering 46 / 88
Python - Enough to Understand the Code
Indentation matters to the meaning of the code
The first assignment to a variable creates it (x = 3)
I Variable types do not need to be declared
I Python figures out the variable types on its own
Logical operators are words (and, or, not), NOT symbols
Simple printing can be done with print
Use a newline to end a line of code
Use  when must go to next line prematurely
A. Ferrari (ISTI) Requirements Engineering 47 / 88
Python - Assignment
Binding a variable in Python means setting a name to hold a
reference to some object
Hence, assignment creates references, not copies (like in Java)
An object is deleted (by the garbage collector) once it becomes
unreachable
A. Ferrari (ISTI) Requirements Engineering 48 / 88
Python - Language Features
Indentation instead of braces
Several sequence types
I Strings (immutable):
’This is a string’, and “Also this is a string”
I Tuples (immutable):
(’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’)
I Lists (mutable):
[’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’]
Lists and Tuples can contain anything:
I ((’REQ’, 1), ’the system shall measure the heartbeat’)
I [((’REQ’, 1), ’the system shall measure the heartbeat’), ((’REQ’, 2),
’the system shall alert the nurse’))]
A. Ferrari (ISTI) Requirements Engineering 49 / 88
Python - Sequence Types
Tuples are defined using parentheses (and commas).
tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
Lists are defined using square brackets (and commas)
li = [“abc”, 34, 4.34, 23]
Strings are defined using quotes (“, ‘, or “““)
st = “Hello World”
st = ‘Hello World’
st = “““Hello World ”””
A. Ferrari (ISTI) Requirements Engineering 50 / 88
Python - Sequence Types
We can access individual members of a tuple, list, or string using
square bracket “array” notation.
We can access individual members of a tuple, lis
using  square  bracket  “array”  notation.  
Note  that  all  are  0  based…  
>>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> tu[1] # Second item in the tuple.
‘abc’
>>> li = [“abc”, 34, 4.34, 23]
>>> li[1] # Second item in the list.
34
>>> st = “Hello  World”
>>> st[1] # Second character in string.
‘e’
A. Ferrari (ISTI) Requirements Engineering 51 / 88
Python - Negative IndicesNegative indices
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Positive index: count from the left, starting with 0.
>>> t[1]
‘abc’
Negative lookup: count from right, starting with –1.
>>> t[-3]
4.56
A. Ferrari (ISTI) Requirements Engineering 52 / 88
Python - Slicing
30
Slicing: Return Copy of a Subset 1
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
Return a copy of the container with a subset of the original
members. Start copying at the first index, and stop copying
before the second index.
>>> t[1:4]
(‘abc’,  4.56,  (2,3))
You can also use negative indices when slicing.
>>> t[1:-1]
(‘abc’,  4.56,  (2,3))
Optional argument allows selection of every nth item.
>>> t[1:-1:2]
(‘abc’,  (2,3))
A. Ferrari (ISTI) Requirements Engineering 53 / 88
Python - ‘in’ OperatorThe  ‘in’  Operator
Boolean test whether a value is inside a collection (often
called a container in Python:
>>> t = [1, 2, 4, 5]
>>> 3 in t
False
>>> 4 in t
True
>>> 4 not in t
False
For strings, tests for substrings
>>> a = 'abcde'
>>> 'c' in a
True
>>> 'cd' in a
True
>>> 'ac' in a
False
Be careful: the in keyword is also used in the syntax of
for loops and list comprehensions.
A. Ferrari (ISTI) Requirements Engineering 54 / 88
Python - ‘+’ Operator
34
The + Operator
The + operator produces a new tuple, list, or string whose
value is the concatenation of its arguments.
Extends concatenation from strings to other types
>>> (1, 2, 3) + (4, 5, 6)
(1, 2, 3, 4, 5, 6)
>>> [1, 2, 3] + [4, 5, 6]
[1, 2, 3, 4, 5, 6]
>>> “Hello”  + “  ”  + “World”
‘Hello  World’
A. Ferrari (ISTI) Requirements Engineering 55 / 88
Lists - Mutable
36
Lists: Mutable
>>> li = [‘abc’, 23, 4.34, 23]
>>> li[1] = 45
>>> li
[‘abc’,  45,  4.34,  23]
We can change lists in place.
Name li still points to the same memory reference when
we’re  done.  
A. Ferrari (ISTI) Requirements Engineering 56 / 88
Tuples - Immutable
37
Tuples: Immutable
>>> t = (23, ‘abc’, 4.56, (2,3), ‘def’)
>>> t[2] = 3.14
Traceback (most recent call last):
File "<pyshell#75>", line 1, in -toplevel-
tu[2] = 3.14
TypeError: object doesn't support item assignment
You  can’t  change  a  tuple.  
You can make a fresh tuple and assign its reference to a previously
used name.
>>> t = (23, ‘abc’, 3.14, (2,3), ‘def’)
The  immutability  of  tuples  means  they’re  faster  than  lists.  
A. Ferrari (ISTI) Requirements Engineering 57 / 88
Operations on Lists
38
Operations on Lists Only 1
>>> li = [1, 11, 3, 4, 5]
>>> li.append(‘a’) # Note the method syntax
>>> li
[1,  11,  3,  4,  5,  ‘a’]
>>> li.insert(2,  ‘i’)
>>>li
[1,  11,  ‘i’,  3,  4,  5,  ‘a’]
A. Ferrari (ISTI) Requirements Engineering 58 / 88
Dictionaries
Dictionaries store a mapping between a set of keys and a set of
values
Keys can be any immutable type
Values can be any type
A. Ferrari (ISTI) Requirements Engineering 59 / 88
Dictionaries - Creating and AccessingCreating and accessing
dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’]
‘bozo’
>>> d[‘pswd’]
1234
>>> d[‘bozo’]
Traceback (innermost last):
File  ‘<interactive  input>’  line  1,  in  ?
KeyError: bozo
A. Ferrari (ISTI) Requirements Engineering 60 / 88
Dictionaries - Updating
Updating Dictionaries
>>> d = {‘user’:‘bozo’, ‘pswd’:1234}
>>> d[‘user’] = ‘clown’
>>> d
{‘user’:‘clown’,  ‘pswd’:1234}
Keys must be unique.
Assigning to an existing key replaces its value.
>>> d[‘id’] = 45
>>> d
{‘user’:‘clown’,  ‘id’:45,  ‘pswd’:1234}
Dictionaries are unordered
New entry might appear anywhere in the output.
(Dictionaries work by hashing)
A. Ferrari (ISTI) Requirements Engineering 61 / 88
Dictionaries - Removing
Removing dictionary entries
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> del d[‘user’] # Remove one. Note that del is
# a function.
>>> d
{‘p’:1234,  ‘i’:34}
>>> d.clear() # Remove all.
>>> d
{}
>>> a=[1,2]
>>> del a[1] # (del also works on lists)
>>> a
[1]
47
A. Ferrari (ISTI) Requirements Engineering 62 / 88
Dictionaries - Useful Methods
Useful Accessor Methods
>>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34}
>>> d.keys() # List of current keys
[‘user’,  ‘p’,  ‘i’]
>>> d.values() # List of current values.
[‘bozo’,  1234,  34]
>>> d.items() # List of item tuples.
[(‘user’,‘bozo’),  (‘p’,1234),  (‘i’,34)]
A. Ferrari (ISTI) Requirements Engineering 63 / 88
If Statements (as expected)
if Statements (as expected)
if x == 3:
print "X equals 3."
elif x == 2:
print "X equals 2."
else:
print "X equals something else."
print "This  is  outside  the  ‘if’."
Note:
Use of indentation for blocks
Colon (:) after boolean expression
54
A. Ferrari (ISTI) Requirements Engineering 64 / 88
While Loops (as expected)
while Loops (as expected)
>>> x = 3
>>> while x < 5:
print x, "still in the loop"
x = x + 1
3 still in the loop
4 still in the loop
>>> x = 6
>>> while x < 5:
print x, "still in the loop"
>>>
55
A. Ferrari (ISTI) Requirements Engineering 65 / 88
For Loops (very cool)For Loops 1
For-each  is  Python’s  only form of for loop
A for loop steps through each of the items in a collection
type,  or  any  other  type  of  object  which  is  “iterable”
for <item> in <collection>:
<statements>
If <collection> is a list or a tuple, then the loop steps
through each element of the sequence.
If <collection> is a string, then the loop steps through each
character of the string.
for someChar in “Hello  World”:
print someChar
59
A. Ferrari (ISTI) Requirements Engineering 66 / 88
For Loops
Items can be more complex than a simple variable name
<statements>
<item> can be more complex than a single
variable name.
If the elements of <collection> are themselves collections,
then <item> can match the structure of the elements. (We
saw something similar with list comprehensions and with
ordinary assignments.)
for (x, y) in [(a,1), (b,2), (c,3), (d,4)]:
print x
60
A. Ferrari (ISTI) Requirements Engineering 67 / 88
For Loops and range() Function
For loops and the range() function
We often want to write a loop where the variables ranges
over some sequence of numbers. The range() function
returns a list of numbers from 0 up to but not including the
number we pass to it.
range(5) returns [0,1,2,3,4]
So we can say:
for x in range(5):
print x
(There are several other forms of range() that provide
variants  of  this  functionality…)
xrange() returns an iterator that provides the same
functionality more efficiently
61
A. Ferrari (ISTI) Requirements Engineering 68 / 88
For Loops and range() Function
Abuse of the range() function
Don't use range() to iterate over a sequence solely to have
the index and elements available at the same time
Avoid:
for i in range(len(mylist)):
print i, mylist[i]
Instead:
for (i, item) in enumerate(mylist):
print i, item
This is an example of an anti-pattern in Python
For more, see:
http://www.seas.upenn.edu/~lignos/py_antipatterns.html
http://stackoverflow.com/questions/576988/python-specific-antipatterns-and-bad-
practices
A. Ferrari (ISTI) Requirements Engineering 69 / 88
List Comprehensions (remarkably cool)
List Comprehensions 1
A powerful feature of the Python language.
Generate a new list by applying a function to every member
of an original list.
Python programmers use list comprehensions extensively.
You’ll  see  many  of  them  in  real  code.
[ expression for name in list ]
64
A. Ferrari (ISTI) Requirements Engineering 70 / 88
List Comprehensions (remarkably cool)
List Comprehensions 2
>>> li = [3, 6, 2, 7]
>>> [elem*2 for elem in li]
[6, 12, 4, 14]
[ expression for name in list ]
Where expression is some calculation or operation
acting upon the variable name.
For each member of the list, the list comprehension
1. sets name equal to that member, and
2. calculates a new value using expression,
It then collects these new values into a list which is the
return value of the list comprehension.
[ expression for name in list ]
65
A. Ferrari (ISTI) Requirements Engineering 71 / 88
List Comprehensions (remarkably cool)
List Comprehensions 3
If the elements of list are other collections, then
name can be replaced by a collection of names
that  match  the  “shape”  of  the  list members.
>>> li =  [(‘a’,  1),  (‘b’,  2),  (‘c’,  7)]
>>> [ n * 3 for (x, n) in li]
[3, 6, 21]
[ expression for name in list ]
66
A. Ferrari (ISTI) Requirements Engineering 72 / 88
Filtered List Comprehensions (remarkably cool)
Filtered List Comprehension 1
Filter determines whether expression is performed
on each member of the list.
When processing each element of list, first check if
it satisfies the filter condition.
If the filter condition returns False, that element is
omitted from the list before the list comprehension
is evaluated.
[ expression for name in list if filter]
67
A. Ferrari (ISTI) Requirements Engineering 73 / 88
Filtered List Comprehensions (remarkably cool)
>>> li = [3, 6, 2, 7, 1, 9]
>>> [elem * 2 for elem in li if elem > 4]
[12, 14, 18]
Only 6, 7, and 9 satisfy the filter condition.
So, only 12, 14, and 18 are produced.
Filtered List Comprehension 2
[ expression for name in list if filter]
A. Ferrari (ISTI) Requirements Engineering 74 / 88
Functions
First line with less
indentation is considered to be
outside of the function definition.
Defining Functions
No declaration of types of arguments or result
def get_final_answer(filename):
"""Documentation String"""
line1
line2
return total_counter
...
Function definition begins with def Function name and its arguments.
‘return’  indicates  the  
value to be sent back to the caller.
Colon.
72
A. Ferrari (ISTI) Requirements Engineering 75 / 88
Function Call
Calling a Function
>>> def myfun(x, y):
return x * y
>>> myfun(3, 4)
12
73
A. Ferrari (ISTI) Requirements Engineering 76 / 88
Import
87
Import I
import somefile
Everything in somefile.py can be referred to by:
somefile.className.method(“abc”)
somefile.myFunction(34)
from somefile import *
Everything in somefile.py can be referred to by:
className.method(“abc”)
myFunction(34)
Careful! This can overwrite the definition of an existing
function or variable!
A. Ferrari (ISTI) Requirements Engineering 77 / 88
Python NLTK Basic Concepts
http://textminingonline.com/
dive-into-nltk-part-i-getting-started-with-nltk
A. Ferrari (ISTI) Requirements Engineering 78 / 88
Python NLTK Main Features
Sentence and word tokenization
POS tagging
Chunking and Named Entity Recognition (NER)
Text Classification
Many included corpora
A. Ferrari (ISTI) Requirements Engineering 79 / 88
Python NLTK - sentence tokenization (splitting)
>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize("hello this is nltk")
[’hello this is nltk’]
>>> sent_tokenize("hello. this is nltk")
[’hello.’, ’this is nltk’]
A. Ferrari (ISTI) Requirements Engineering 80 / 88
Python NLTK - word tokenization
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("this is nltk")
[’this’, ’is’, ’nltk’]
A. Ferrari (ISTI) Requirements Engineering 81 / 88
Python NLTK - word tokenization with punctuation
>>> word_tokenize("The woman’s car")
[’The’, ’woman’, "’s", ’car’]
>>> from nltk.tokenize import wordpunct_tokenize
>>> wordpunct_tokenize("The woman’s car")
[’The’, ’woman’, "’", ’s’, ’car’]
A. Ferrari (ISTI) Requirements Engineering 82 / 88
Python NLTK - stemming
>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()
>>> porter_stemmer.stem(’requirements’)
’requir’
A. Ferrari (ISTI) Requirements Engineering 83 / 88
Python NLTK - lemmatizer
>>> from nltk.stem import WordNetLemmatizer
>>> wordnet_lemmatizer = WordNetLemmatizer()
>>> wordnet_lemmatizer.lemmatize(’requirements’)
’requirement’
>>> wordnet_lemmatizer.lemmatize(’is’)
’is’
>>> wordnet_lemmatizer.lemmatize(’is’, pos=’v’)
’be’
A. Ferrari (ISTI) Requirements Engineering 84 / 88
Python NLTK - POS tagging
>>> t = word_tokenize("The system shall be nice")
>>> from nltk import pos_tag
>>> pos_tag(t)
[(’The’, ’DT’), (’system’, ’NN’),
(’shall’, ’MD’), (’be’, ’VB’), (’nice’, ’JJ’)]
A. Ferrari (ISTI) Requirements Engineering 85 / 88
Python NLTK - POS tagging
How can I know the meaning of the tags?
>>> import nltk
>>> nltk.help.upenn_tagset(’DT’)
DT: determiner
all an another any both del each either every half la many much nary
neither no some such that the them these this those
>>> nltk.help.upenn_tagset(’NN’)
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
>>> nltk.help.upenn_tagset(’NNP’)
NNP: noun, proper, singular
Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
Shannon A.K.C. Meltex Liverpool ...
A. Ferrari (ISTI) Requirements Engineering 86 / 88
Python NLTK - stopword removal
>>> from nltk.corpus import stopwords
>>> stoplist = stopwords.words(’english’)
>>> sentence = ’The system shall be usable’
>>> print [i for i in word_tokenize(sentence) if i not in stoplist]
[’The’, ’system’, ’shall’, ’usable’]
>>> print [i for i in word_tokenize(sentence.lower())
if i not in stoplist]
[’system’, ’shall’, ’usable’]
A. Ferrari (ISTI) Requirements Engineering 87 / 88
When to use What
GATE is good to annotate documents, and find patterns
Python NLTK is good for fast prototyping of analysis tools
If you have to extract information from text: use GATE
If you have to do more fine-grained analyses, and you do not need
to build a well-structured program: use Python NLTK
If you have to build a structured program: study the OpenNLP
library (Java)
A. Ferrari (ISTI) Requirements Engineering 88 / 88

Requirements Engineering: focus on Natural Language Processing, Lecture 2

  • 1.
    Requirements Engineering: Focus onNatural Language Requirements Processing Alessio Ferrari1 1ISTI-CNR, Pisa, Italy January 15, 2016 A. Ferrari (ISTI) Requirements Engineering 1 / 88 cf. https://docplayer.net/16051752-Introduction-to-programming-languages-and- techniques-xkcd-com-full-python-tutorial.html
  • 2.
    Objectives of theCourse Objectives: lesson 1 Understand what are requirements, specifications and domain assertions Understand the problems behind requirements elicitation and SRS definition Learn how to write a decent SRS Objectives: lesson 2 Learn the basics of Natual Language Processing (NLP) Learn the relation between NLP and Requirements Learn the basics of GATE to detect NL ambiguities in requirements Learn the basics of Python NLTK (for further manipulation of NL requirements) A. Ferrari (ISTI) Requirements Engineering 2 / 88
  • 3.
    NLP and RequirementsEngineering A. Ferrari (ISTI) Requirements Engineering 3 / 88
  • 4.
    Natural Language Processing Technologiesenabling extraction and manipulation of information from Natural Language (NL) Dan$Jurafsky$ Language(Technology( Coreference$resoluIon$ QuesIon$answering$(QA)$ PartOofOspeech$(POS)$tagging$ Word$sense$disambiguaIon$(WSD)$ Paraphrase$ Named$enIty$recogniIon$(NER)$ Parsing$ SummarizaIon$ InformaIon$extracIon$(IE)$ Machine$translaIon$(MT)$ Dialog$ SenIment$analysis$ $$$ mostly$solved$ making$good$progress$ sIll$really$hard$ Spam$detecIon$ Let’s$go$to$Agra!$ Buy$V1AGRA$…$ ✓ ✗ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ Einstein$met$with$UN$officials$in$Princeton$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ You’re$invited$to$our$dinner$ party,$Friday$May$27$at$8:30$ Party$ May$27$ add$ Best$roast$chicken$in$San$Francisco!$ The$waiter$ignored$us$for$20$minutes.$ Carter$told$Mubarak$he$shouldn’t$run$again.$ I$need$new$baWeries$for$my$mouse.$ The$13th$Shanghai$InternaIonal$Film$FesIval…$ 13 … The$Dow$Jones$is$up$ Housing$prices$rose$ Economy$is$ good$ Q.$How$effecIve$is$ibuprofen$in$reducing$ fever$in$paIents$with$acute$febrile$illness?$ I$can$see$Alcatraz$from$the$window!$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ Where$is$CiIzen$Kane$playing$in$SF?$$ Castro$Theatre$at$7:30.$Do$ you$want$a$Icket?$ The$S&P500$jumped$ Dan$Jurafsky$ Language(Technology( Coreference$resoluIon$ QuesIon$answering$(QA)$ PartOofOspeech$(POS)$tagging$ Word$sense$disambiguaIon$(WSD)$ Paraphrase$ Named$enIty$recogniIon$(NER)$ Parsing$ SummarizaIon$ InformaIon$extracIon$(IE)$ Machine$translaIon$(MT)$ Dialog$ SenIment$analysis$ $$$ mostly$solved$ making$good$progress$ sIll$really$hard$ Spam$detecIon$ Let’s$go$to$Agra!$ Buy$V1AGRA$…$ ✓ ✗ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ Einstein$met$with$UN$officials$in$Princeton$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ You’re$invited$to$our$dinner$ party,$Friday$May$27$at$8:30$ Party$ May$27$ add$ Best$roast$chicken$in$San$Francisco!$ The$waiter$ignored$us$for$20$minutes.$ Carter$told$Mubarak$he$shouldn’t$run$again.$ I$need$new$baWeries$for$my$mouse.$ The$13th$Shanghai$InternaIonal$Film$FesIval…$ 13 … The$Dow$Jones$is$up$ Housing$prices$rose$ Economy$is$ good$ Q.$How$effecIve$is$ibuprofen$in$reducing$ fever$in$paIents$with$acute$febrile$illness?$ I$can$see$Alcatraz$from$the$window!$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ Where$is$CiIzen$Kane$playing$in$SF?$$ Castro$Theatre$at$7:30.$Do$ you$want$a$Icket?$ The$S&P500$jumped$ Dan$Jurafsky$ Language(Technology( Coreference$resoluIon$ QuesIon$answering$(QA)$ PartOofOspeech$(POS)$tagging$ Word$sense$disambiguaIon$(WSD)$ Paraphrase$ Named$enIty$recogniIon$(NER)$ Parsing$ SummarizaIon$ InformaIon$extracIon$(IE)$ Machine$translaIon$(MT)$ Dialog$ SenIment$analysis$ $$$ mostly$solved$ making$good$progress$ sIll$really$hard$ Spam$detecIon$ Let’s$go$to$Agra!$ Buy$V1AGRA$…$ ✓ ✗ Colorless$$$green$$$ideas$$$sleep$$$furiously.$ $$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$ Einstein$met$with$UN$officials$in$Princeton$ PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$ You’re$invited$to$our$dinner$ party,$Friday$May$27$at$8:30$ Party$ May$27$ add$ Best$roast$chicken$in$San$Francisco!$ The$waiter$ignored$us$for$20$minutes.$ Carter$told$Mubarak$he$shouldn’t$run$again.$ I$need$new$baWeries$for$my$mouse.$ The$13th$Shanghai$InternaIonal$Film$FesIval…$ 13 … The$Dow$Jones$is$up$ Housing$prices$rose$ Economy$is$ good$ Q.$How$effecIve$is$ibuprofen$in$reducing$ fever$in$paIents$with$acute$febrile$illness?$ I$can$see$Alcatraz$from$the$window!$ XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ Where$is$CiIzen$Kane$playing$in$SF?$$ Castro$Theatre$at$7:30.$Do$ you$want$a$Icket?$ The$S&P500$jumped$ A. Ferrari (ISTI) Requirements Engineering 4 / 88
  • 5.
    Natural Language Processing Languageis inherently ambiguous and hard to be handled I Imagine a well-defined syntax and semantics for an input message: you can write software to process it I With NL you do not have a well defined syntax and semantics: there is always some exception Rule-based approaches are hard to build Luckily, there’s machine-learning! . . . but I :( Large amounts of data are needed for training/test/validation I :( Tagging data requires domain expertise I :( Requirements use domain-specific jargon (features based on terms might work only for a specific company) I :( Requirements structure vary from company to company (features based on syntax might need adjustments) I :( Some tasks (e.g., ambiguity) have too many negative cases (i.e., well-written), and too few positive (i.e., ambiguous) A. Ferrari (ISTI) Requirements Engineering 5 / 88
  • 6.
    4 Dimensions ofNatural Language Requirements Processing Discipline: requirements have to be non-ambiguous but readable, no equivalent requirement Dynamism: requirements evolve, and have to be traced and categorised Domain Knowledge: domain knowledge has to be gathered to understand requirements, proper stakeholders have to be identified Data-sets: tagged data-sets are required, requirements are private A. Ferrari (ISTI) Requirements Engineering 6 / 88
  • 7.
    4-D in theRequirements Process Elicitation Analysis ValidationManagement Traceability Domain Term Identification Stakeholder Identification Domain Scoping Ambiguity Readability ComplianceCategorisation Equivalence D I S C I P L I N E DOMAIN KNOWLEDGE DYNAMISM DATA-SETS Retrieval of Internal Requirements Market Analysis A. Ferrari (ISTI) Requirements Engineering 7 / 88
  • 8.
    Focus on Discipline Anambiguity in a requirement might cause severe consequences along the development False negatives are NOT tolerable (i.e., I have to discover ALL potential ambiguities) It is preferable to have a certain amount of false positive cases, which the engineer can easily discard Rec = | true ambiguities items retrieved | / | true ambiguities | Prec = | true ambiguities items retrieved | / | items retrieved | Hence, a system for ambiguity detection has to provide 100% recall! Rule-based approaches are preferred However, for some tasks, rule-based approaches give really too many false positives. A. Ferrari (ISTI) Requirements Engineering 8 / 88
  • 9.
    What you’ll learn Rule-basedapproaches Basic tools that you can use to implement rule-based approaches These tools can be used as a basis if you want to move to machine learning A. Ferrari (ISTI) Requirements Engineering 9 / 88
  • 10.
    NLP Minimal Conceptsand Terminology A. Ferrari (ISTI) Requirements Engineering 10 / 88
  • 11.
    NLP Minimal Concepts Everytask in NLP needs to perform Text Normalization: Segmenting/Tokenizing words in the text Normalizing word formats (Stemming and Lemmatization) Segmenting/Tokenizing sentences Besides that, many tasks need Part-Of-Speech (POS) Tagging A. Ferrari (ISTI) Requirements Engineering 11 / 88
  • 12.
    NLP Minimal Concepts- Tokenization (or Word Segmentation) A token is an occurrence of a WORD in a stream of text Tokenization split a stream of text into a list of WORDS Dan(Jurafsky( How(many(words?( they(lay(back(on(the(San(Francisco(grass(and(looked(at(the(stars(and(their( •  Type:(an(element(of(the(vocabulary.( •  Token:(an(instance(of(that(type(in(running(text.( •  How(many?( •  15(tokens((or(14)( •  13(types((or(12)((or(11?)( A. Ferrari (ISTI) Requirements Engineering 12 / 88
  • 13.
    NLP Minimal Concepts- Multi-word Terms Multi-word Terms Example: “The wayside equipment shall respond within 5 seconds” wayside equipment is a multi-word term like San Francisco We do not discuss about multi-word terms, but keep them in mind because they are frequent in requirements A. Ferrari (ISTI) Requirements Engineering 13 / 88
  • 14.
    NLP Minimal Concepts- Lemmatization Dan(Jurafsky( Lemma4za4on( •  Reduce(inflecGons(or(variant(forms(to(base(form( •  am,"are,(is"→(be( •  car,"cars,"car's,(cars'(→(car" •  the"boy's"cars"are"different"colors(→(the"boy"car"be"different"color" •  LemmaGzaGon:(have(to(find(correct(dicGonary(headword(form( •  Machine(translaGon( •  Spanish(quiero((‘I(want’),(quieres((‘you(want’)(same(lemma(as(querer(‘want’( A. Ferrari (ISTI) Requirements Engineering 14 / 88
  • 15.
    NLP Minimal Concepts- Stemming Dan(Jurafsky( Stemming( •  Reduce(terms(to(their(stems(in(informaGon(retrieval( •  Stemming(is(crude(chopping(of(affixes( •  language(dependent( •  e.g.,(automate(s),+automaGc,+automaGon(all(reduced(to(automat.( for"example"compressed"" and"compression"are"both"" accepted"as"equivalent"to"" compress.( for(exampl(compress(and( compress(ar(both(accept( as(equival(to(compress( A. Ferrari (ISTI) Requirements Engineering 15 / 88
  • 16.
    NLP Minimal Concepts- Sentence SegmentationDan(Jurafsky( Sentence(Segmenta4on( •  !,(?(are(relaGvely(unambiguous( •  Period(“.”(is(quite(ambiguous( •  Sentence(boundary( •  AbbreviaGons(like(Inc.(or(Dr.( •  Numbers(like(.02%(or(4.3( •  Build(a(binary(classifier( •  Looks(at(a(“.”( •  Decides(EndOfSentence/NotEndOfSentence( •  Classifiers:(handkwri@en(rules,(regular(expressions,(or(machineklearning( A. Ferrari (ISTI) Requirements Engineering 16 / 88
  • 17.
    NLP Minimal Concepts- POS Tagging Open class (lexical) words Closed class (functional) Nouns Verbs Proper Common Modals Main Adjectives Adverbs Prepositions Particles Determiners Conjunctions Pronouns … more … more IBM Italy cat / cats snow see registered can had old older oldest slowly to with off up the some and or he its Numbers 122,312 one Interjections Ow Eh A. Ferrari (ISTI) Requirements Engineering 17 / 88
  • 18.
    NLP Minimal Concepts- POS TaggingChristopher"Manning" POS-Tagging- •  Words"oYen"have"more"than"one"POS:"back% •  The"back"door"="JJ" •  On"my"back"="NN" •  Win"the"voters"back"="RB" •  Promised"to"back"the"bill"="VB" •  The"POS"tagging"problem"is"to"determine"the"POS"tag"for"a" par1cular"instance"of"a"word." https://www.ling.upenn.edu/courses/Fall_2003/ ling001/penn_treebank_pos.html A. Ferrari (ISTI) Requirements Engineering 18 / 88
  • 19.
    GATE Basic Concepts A.Ferrari (ISTI) Requirements Engineering 19 / 88
  • 20.
    IR vs IE InformationRetrieval (IR): pulls documents from large corpora Information Extraction (IE): retrieves facts and structured information from large corpora IR returns documents containing the relevant information somewhere (normally fast) IE returns precise and structured information (can be slow) GATE (General Architecture for Text Engineering) supports IE Potential Usage Entity, Events, Relation extraction Enrich features for retrieval Annotate documents for ML Ambiguity identification with rule-based approaches A. Ferrari (ISTI) Requirements Engineering 20 / 88
  • 21.
    Document, Corpus, Annotation,Feature Document = text + annotations + features Corpus = collection of documents Linguistic information in documents is encoded in the form of annotations (like colored mark-ups) Annotations have features with relative types and values Example Annotation: Sentence Length Feature 1: Length in Characters, Value 1 = 100 Feature 2: Length in Token, Value 2 = 15 A. Ferrari (ISTI) Requirements Engineering 21 / 88
  • 22.
    Document, Corpus, Annotation,Feature A. Ferrari (ISTI) Requirements Engineering 22 / 88
  • 23.
    Processing Resources (PR) PRare algorithms that make NLP easy Document Reset PR - delete Annotations ANNIE English Tokeniser - identify tokens (words, numbers, etc.) ANNIE Sentence Splitter - identify sentence boundaries ANNIE Gazetteer - identify specific tokens in a list ANNIE POS Tagger - identify part-of-speech (POS), like name, adjective, etc. JAPE Transducers - user-defined annotations based on regular expressions over annotations ANNIE collects all the algorithms and run them in PIPELINE A. Ferrari (ISTI) Requirements Engineering 23 / 88
  • 24.
    Using Processing Resourcesin GATE Open GATE Language Resources > New > GATE Document Load Source Document: exercises/requirements.txt Right-click> New Corpus from this document Right-click>Processing Resources (then select the resources reported in the next slides) A. Ferrari (ISTI) Requirements Engineering 24 / 88
  • 25.
    ANNIE English Tokenizer ProducesToken annotations A. Ferrari (ISTI) Requirements Engineering 25 / 88
  • 26.
    ANNIE Sentence Splitter ProducesSentence annotations A. Ferrari (ISTI) Requirements Engineering 26 / 88
  • 27.
    Gazetteer Produces Lookup annotations A.Ferrari (ISTI) Requirements Engineering 27 / 88
  • 28.
    ANNIE POS Tagger Modifiesthe Token annotations with a new feature category = NN, JJ, VB, etc. A. Ferrari (ISTI) Requirements Engineering 28 / 88
  • 29.
    JAPE Transducers Gazetteer listsare designed for annotating simple, regular features Even identifying simple patterns like e-mails is impossible with a Gazetteer What is JAPE? JAPE provides pattern matching in GATE Each JAPE rule consists of: I LHS which contains patterns to match I RHS which details the annotations (and optionally features) to be created A. Ferrari (ISTI) Requirements Engineering 29 / 88
  • 30.
    Example I want tofind all the occurrences of the term “level” followed by a number (level 1, level 2, etc.) A. Ferrari (ISTI) Requirements Engineering 30 / 88
  • 31.
    Example (continue) Making itcase-insensitive A. Ferrari (ISTI) Requirements Engineering 31 / 88
  • 32.
    Example (continue) Adding featuresand values A. Ferrari (ISTI) Requirements Engineering 32 / 88
  • 33.
    GATE for AmbiguityDetection A. Ferrari (ISTI) Requirements Engineering 33 / 88
  • 34.
    Steps - DetectingVagueness Open GATE Language Resources > New > GATE Document Load Source Document: exercises/requirements.txt Right-click> New Corpus from this document Copy file exercises/vagueness.lst to /GATE_Developer_8.0/plugins/ANNIE/resources/gazetteer Open file lists.def (here you have all the file pointers to the lists checked by the Look-up) Add the line: vagueness.lst:vague_term A. Ferrari (ISTI) Requirements Engineering 34 / 88
  • 35.
    Steps - DetectingVagueness (continue) Go back to GATE Processing Resources>New>ANNIE Gazetteer, and give it a name Check: case sensitive = false Click on the Gazetteer, and check the presence of vagueness.lst Applications>ANNIE Move the Gazetteer to the right panel Run! A. Ferrari (ISTI) Requirements Engineering 35 / 88
  • 36.
    Steps - AnnotatingVagueness Processing Resources>New>Jape Transducer Load annotate_vagueness.jape Applications>ANNIE Move the JAPE rule to the right panel Run! A. Ferrari (ISTI) Requirements Engineering 36 / 88
  • 37.
    Coordination Ambiguities Detect allsentences that include a potential coordination ambiguity Coordination ambiguities are of two types: I Example: The system shall produce the plot or the data-log and the legend. I Example: New employees and directors shall be provided with a user-name and password. A. Ferrari (ISTI) Requirements Engineering 37 / 88
  • 38.
    Coordination Ambiguities -Jape Rule 0 Retrieve all occurrences of And / Or Phase: MatchAndOr Input: Token Options: control = appelt //Operator ==~ returns only whole string matches //Operator (?i) tells that the matching is case insensitive //Operator | is the logical "or" operator Rule: checkAndOr Priority: 1 ( {Token.string ==~ "(?i)and"} | {Token.string ==~ "(?i)or"} ):coord --> :coord.AndOr = {} A. Ferrari (ISTI) Requirements Engineering 38 / 88
  • 39.
    Coordination Ambiguities -Jape Rule 1 Annotate all the sequences of And/Or in the same sentence Phase: MatchCoordinationSequences Input: Split AndOr Options: control = appelt //Note that, having Split among the input Annotations allows //us to identify sequences of And/Or in the same sentence //The ’+’ operator means "one or more occurrences" //The ’*’ operator means "zero or more occurrences" //The ’?’ operator means "zero or one occurrences" Rule: checkCoordinationSequences Priority: 1 ( {AndOr} ({AndOr})+ ):coordSequence --> :coordSequence.AndOrSequence = {} A. Ferrari (ISTI) Requirements Engineering 39 / 88
  • 40.
    Coordination Ambiguities -Jape Rule 2 Annotate all the sentences with And/Or sequences Phase: MatchCoordinationSentences Input: Sentence AndOrSequence Options: control = appelt //A "contains" B: searches for //annotations A that contain annotations B Rule: checkCoordinationSentences Priority: 1 ( {Sentence contains AndOrSequence} ):coordSentence --> :coordSentence.AndOrSentence = {} A. Ferrari (ISTI) Requirements Engineering 40 / 88
  • 41.
    Your Turn! -Detect Coordination Ambiguities of the Second Type Add the Example: new employees and directors shall be provided with a user-name and password Remember that ANNIE POS Tagger gives you the category of the Token JJ = Adjective, NN = Noun, NNS = Plural Noun More information on the tag-set, and many other things, can be found in: https://gate.ac.uk/sale/tao/tao.pdf A. Ferrari (ISTI) Requirements Engineering 41 / 88
  • 42.
    Python and NLTK A.Ferrari (ISTI) Requirements Engineering 42 / 88
  • 43.
    Python Basic Concepts PythonOnline! http://www.learnpython.org/ Tutorial: https://docs.python.org/2/tutorial/ A. Ferrari (ISTI) Requirements Engineering 43 / 88
  • 44.
    Python - MainFeatures Interpreted language Dynamically typed (variables do not have a predefined type!) Rich, built-in collection types I Lists I Tuples I Dictionaries I Sets Concise My Humble Opinion It is perfect if you want to develop something quickly It is a mess if you want to develop a structured software A. Ferrari (ISTI) Requirements Engineering 44 / 88
  • 45.
    Which Python? Python 2.7:that’s the one we’ll use here I Last stable release before version 3 I Implements parts of the version 3 features I Fully backwards compatible Python 3.0: that’s the one you should use in the future I Released on Dec. 3, 2008 I Many changes (also incompatible ones) I . . . But: few important third-party libraries are not yet compatible with Python 3.0 I Version compatible with NLTK 3.0: either 2.7, or 3.2+ Hint: if you want to use NLTK, go to: http://www.nltk.org/install.html, and check the links to the additional libraries and python recommended releases, before downloading Python A. Ferrari (ISTI) Requirements Engineering 45 / 88
  • 46.
    Python - Example ACode Sample (in IDLE) x = 34 - 23 # A comment. y = “Hello” # Another one. z = 3.45 if z == 3.45 or y == “Hello”: x = x + 1 y = y + “ World” # String concat. print x print y A. Ferrari (ISTI) Requirements Engineering 46 / 88
  • 47.
    Python - Enoughto Understand the Code Indentation matters to the meaning of the code The first assignment to a variable creates it (x = 3) I Variable types do not need to be declared I Python figures out the variable types on its own Logical operators are words (and, or, not), NOT symbols Simple printing can be done with print Use a newline to end a line of code Use when must go to next line prematurely A. Ferrari (ISTI) Requirements Engineering 47 / 88
  • 48.
    Python - Assignment Bindinga variable in Python means setting a name to hold a reference to some object Hence, assignment creates references, not copies (like in Java) An object is deleted (by the garbage collector) once it becomes unreachable A. Ferrari (ISTI) Requirements Engineering 48 / 88
  • 49.
    Python - LanguageFeatures Indentation instead of braces Several sequence types I Strings (immutable): ’This is a string’, and “Also this is a string” I Tuples (immutable): (’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’) I Lists (mutable): [’the’, ’system’, ’shall’, ’measure’, ’the’, ’heartbeat’] Lists and Tuples can contain anything: I ((’REQ’, 1), ’the system shall measure the heartbeat’) I [((’REQ’, 1), ’the system shall measure the heartbeat’), ((’REQ’, 2), ’the system shall alert the nurse’))] A. Ferrari (ISTI) Requirements Engineering 49 / 88
  • 50.
    Python - SequenceTypes Tuples are defined using parentheses (and commas). tu = (23, ‘abc’, 4.56, (2,3), ‘def’) Lists are defined using square brackets (and commas) li = [“abc”, 34, 4.34, 23] Strings are defined using quotes (“, ‘, or “““) st = “Hello World” st = ‘Hello World’ st = “““Hello World ””” A. Ferrari (ISTI) Requirements Engineering 50 / 88
  • 51.
    Python - SequenceTypes We can access individual members of a tuple, list, or string using square bracket “array” notation. We can access individual members of a tuple, lis using  square  bracket  “array”  notation.   Note  that  all  are  0  based…   >>> tu = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> tu[1] # Second item in the tuple. ‘abc’ >>> li = [“abc”, 34, 4.34, 23] >>> li[1] # Second item in the list. 34 >>> st = “Hello  World” >>> st[1] # Second character in string. ‘e’ A. Ferrari (ISTI) Requirements Engineering 51 / 88
  • 52.
    Python - NegativeIndicesNegative indices >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) Positive index: count from the left, starting with 0. >>> t[1] ‘abc’ Negative lookup: count from right, starting with –1. >>> t[-3] 4.56 A. Ferrari (ISTI) Requirements Engineering 52 / 88
  • 53.
    Python - Slicing 30 Slicing:Return Copy of a Subset 1 >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) Return a copy of the container with a subset of the original members. Start copying at the first index, and stop copying before the second index. >>> t[1:4] (‘abc’,  4.56,  (2,3)) You can also use negative indices when slicing. >>> t[1:-1] (‘abc’,  4.56,  (2,3)) Optional argument allows selection of every nth item. >>> t[1:-1:2] (‘abc’,  (2,3)) A. Ferrari (ISTI) Requirements Engineering 53 / 88
  • 54.
    Python - ‘in’OperatorThe  ‘in’  Operator Boolean test whether a value is inside a collection (often called a container in Python: >>> t = [1, 2, 4, 5] >>> 3 in t False >>> 4 in t True >>> 4 not in t False For strings, tests for substrings >>> a = 'abcde' >>> 'c' in a True >>> 'cd' in a True >>> 'ac' in a False Be careful: the in keyword is also used in the syntax of for loops and list comprehensions. A. Ferrari (ISTI) Requirements Engineering 54 / 88
  • 55.
    Python - ‘+’Operator 34 The + Operator The + operator produces a new tuple, list, or string whose value is the concatenation of its arguments. Extends concatenation from strings to other types >>> (1, 2, 3) + (4, 5, 6) (1, 2, 3, 4, 5, 6) >>> [1, 2, 3] + [4, 5, 6] [1, 2, 3, 4, 5, 6] >>> “Hello”  + “  ”  + “World” ‘Hello  World’ A. Ferrari (ISTI) Requirements Engineering 55 / 88
  • 56.
    Lists - Mutable 36 Lists:Mutable >>> li = [‘abc’, 23, 4.34, 23] >>> li[1] = 45 >>> li [‘abc’,  45,  4.34,  23] We can change lists in place. Name li still points to the same memory reference when we’re  done.   A. Ferrari (ISTI) Requirements Engineering 56 / 88
  • 57.
    Tuples - Immutable 37 Tuples:Immutable >>> t = (23, ‘abc’, 4.56, (2,3), ‘def’) >>> t[2] = 3.14 Traceback (most recent call last): File "<pyshell#75>", line 1, in -toplevel- tu[2] = 3.14 TypeError: object doesn't support item assignment You  can’t  change  a  tuple.   You can make a fresh tuple and assign its reference to a previously used name. >>> t = (23, ‘abc’, 3.14, (2,3), ‘def’) The  immutability  of  tuples  means  they’re  faster  than  lists.   A. Ferrari (ISTI) Requirements Engineering 57 / 88
  • 58.
    Operations on Lists 38 Operationson Lists Only 1 >>> li = [1, 11, 3, 4, 5] >>> li.append(‘a’) # Note the method syntax >>> li [1,  11,  3,  4,  5,  ‘a’] >>> li.insert(2,  ‘i’) >>>li [1,  11,  ‘i’,  3,  4,  5,  ‘a’] A. Ferrari (ISTI) Requirements Engineering 58 / 88
  • 59.
    Dictionaries Dictionaries store amapping between a set of keys and a set of values Keys can be any immutable type Values can be any type A. Ferrari (ISTI) Requirements Engineering 59 / 88
  • 60.
    Dictionaries - Creatingand AccessingCreating and accessing dictionaries >>> d = {‘user’:‘bozo’, ‘pswd’:1234} >>> d[‘user’] ‘bozo’ >>> d[‘pswd’] 1234 >>> d[‘bozo’] Traceback (innermost last): File  ‘<interactive  input>’  line  1,  in  ? KeyError: bozo A. Ferrari (ISTI) Requirements Engineering 60 / 88
  • 61.
    Dictionaries - Updating UpdatingDictionaries >>> d = {‘user’:‘bozo’, ‘pswd’:1234} >>> d[‘user’] = ‘clown’ >>> d {‘user’:‘clown’,  ‘pswd’:1234} Keys must be unique. Assigning to an existing key replaces its value. >>> d[‘id’] = 45 >>> d {‘user’:‘clown’,  ‘id’:45,  ‘pswd’:1234} Dictionaries are unordered New entry might appear anywhere in the output. (Dictionaries work by hashing) A. Ferrari (ISTI) Requirements Engineering 61 / 88
  • 62.
    Dictionaries - Removing Removingdictionary entries >>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34} >>> del d[‘user’] # Remove one. Note that del is # a function. >>> d {‘p’:1234,  ‘i’:34} >>> d.clear() # Remove all. >>> d {} >>> a=[1,2] >>> del a[1] # (del also works on lists) >>> a [1] 47 A. Ferrari (ISTI) Requirements Engineering 62 / 88
  • 63.
    Dictionaries - UsefulMethods Useful Accessor Methods >>> d = {‘user’:‘bozo’, ‘p’:1234, ‘i’:34} >>> d.keys() # List of current keys [‘user’,  ‘p’,  ‘i’] >>> d.values() # List of current values. [‘bozo’,  1234,  34] >>> d.items() # List of item tuples. [(‘user’,‘bozo’),  (‘p’,1234),  (‘i’,34)] A. Ferrari (ISTI) Requirements Engineering 63 / 88
  • 64.
    If Statements (asexpected) if Statements (as expected) if x == 3: print "X equals 3." elif x == 2: print "X equals 2." else: print "X equals something else." print "This  is  outside  the  ‘if’." Note: Use of indentation for blocks Colon (:) after boolean expression 54 A. Ferrari (ISTI) Requirements Engineering 64 / 88
  • 65.
    While Loops (asexpected) while Loops (as expected) >>> x = 3 >>> while x < 5: print x, "still in the loop" x = x + 1 3 still in the loop 4 still in the loop >>> x = 6 >>> while x < 5: print x, "still in the loop" >>> 55 A. Ferrari (ISTI) Requirements Engineering 65 / 88
  • 66.
    For Loops (verycool)For Loops 1 For-each  is  Python’s  only form of for loop A for loop steps through each of the items in a collection type,  or  any  other  type  of  object  which  is  “iterable” for <item> in <collection>: <statements> If <collection> is a list or a tuple, then the loop steps through each element of the sequence. If <collection> is a string, then the loop steps through each character of the string. for someChar in “Hello  World”: print someChar 59 A. Ferrari (ISTI) Requirements Engineering 66 / 88
  • 67.
    For Loops Items canbe more complex than a simple variable name <statements> <item> can be more complex than a single variable name. If the elements of <collection> are themselves collections, then <item> can match the structure of the elements. (We saw something similar with list comprehensions and with ordinary assignments.) for (x, y) in [(a,1), (b,2), (c,3), (d,4)]: print x 60 A. Ferrari (ISTI) Requirements Engineering 67 / 88
  • 68.
    For Loops andrange() Function For loops and the range() function We often want to write a loop where the variables ranges over some sequence of numbers. The range() function returns a list of numbers from 0 up to but not including the number we pass to it. range(5) returns [0,1,2,3,4] So we can say: for x in range(5): print x (There are several other forms of range() that provide variants  of  this  functionality…) xrange() returns an iterator that provides the same functionality more efficiently 61 A. Ferrari (ISTI) Requirements Engineering 68 / 88
  • 69.
    For Loops andrange() Function Abuse of the range() function Don't use range() to iterate over a sequence solely to have the index and elements available at the same time Avoid: for i in range(len(mylist)): print i, mylist[i] Instead: for (i, item) in enumerate(mylist): print i, item This is an example of an anti-pattern in Python For more, see: http://www.seas.upenn.edu/~lignos/py_antipatterns.html http://stackoverflow.com/questions/576988/python-specific-antipatterns-and-bad- practices A. Ferrari (ISTI) Requirements Engineering 69 / 88
  • 70.
    List Comprehensions (remarkablycool) List Comprehensions 1 A powerful feature of the Python language. Generate a new list by applying a function to every member of an original list. Python programmers use list comprehensions extensively. You’ll  see  many  of  them  in  real  code. [ expression for name in list ] 64 A. Ferrari (ISTI) Requirements Engineering 70 / 88
  • 71.
    List Comprehensions (remarkablycool) List Comprehensions 2 >>> li = [3, 6, 2, 7] >>> [elem*2 for elem in li] [6, 12, 4, 14] [ expression for name in list ] Where expression is some calculation or operation acting upon the variable name. For each member of the list, the list comprehension 1. sets name equal to that member, and 2. calculates a new value using expression, It then collects these new values into a list which is the return value of the list comprehension. [ expression for name in list ] 65 A. Ferrari (ISTI) Requirements Engineering 71 / 88
  • 72.
    List Comprehensions (remarkablycool) List Comprehensions 3 If the elements of list are other collections, then name can be replaced by a collection of names that  match  the  “shape”  of  the  list members. >>> li =  [(‘a’,  1),  (‘b’,  2),  (‘c’,  7)] >>> [ n * 3 for (x, n) in li] [3, 6, 21] [ expression for name in list ] 66 A. Ferrari (ISTI) Requirements Engineering 72 / 88
  • 73.
    Filtered List Comprehensions(remarkably cool) Filtered List Comprehension 1 Filter determines whether expression is performed on each member of the list. When processing each element of list, first check if it satisfies the filter condition. If the filter condition returns False, that element is omitted from the list before the list comprehension is evaluated. [ expression for name in list if filter] 67 A. Ferrari (ISTI) Requirements Engineering 73 / 88
  • 74.
    Filtered List Comprehensions(remarkably cool) >>> li = [3, 6, 2, 7, 1, 9] >>> [elem * 2 for elem in li if elem > 4] [12, 14, 18] Only 6, 7, and 9 satisfy the filter condition. So, only 12, 14, and 18 are produced. Filtered List Comprehension 2 [ expression for name in list if filter] A. Ferrari (ISTI) Requirements Engineering 74 / 88
  • 75.
    Functions First line withless indentation is considered to be outside of the function definition. Defining Functions No declaration of types of arguments or result def get_final_answer(filename): """Documentation String""" line1 line2 return total_counter ... Function definition begins with def Function name and its arguments. ‘return’  indicates  the   value to be sent back to the caller. Colon. 72 A. Ferrari (ISTI) Requirements Engineering 75 / 88
  • 76.
    Function Call Calling aFunction >>> def myfun(x, y): return x * y >>> myfun(3, 4) 12 73 A. Ferrari (ISTI) Requirements Engineering 76 / 88
  • 77.
    Import 87 Import I import somefile Everythingin somefile.py can be referred to by: somefile.className.method(“abc”) somefile.myFunction(34) from somefile import * Everything in somefile.py can be referred to by: className.method(“abc”) myFunction(34) Careful! This can overwrite the definition of an existing function or variable! A. Ferrari (ISTI) Requirements Engineering 77 / 88
  • 78.
    Python NLTK BasicConcepts http://textminingonline.com/ dive-into-nltk-part-i-getting-started-with-nltk A. Ferrari (ISTI) Requirements Engineering 78 / 88
  • 79.
    Python NLTK MainFeatures Sentence and word tokenization POS tagging Chunking and Named Entity Recognition (NER) Text Classification Many included corpora A. Ferrari (ISTI) Requirements Engineering 79 / 88
  • 80.
    Python NLTK -sentence tokenization (splitting) >>> from nltk.tokenize import sent_tokenize >>> sent_tokenize("hello this is nltk") [’hello this is nltk’] >>> sent_tokenize("hello. this is nltk") [’hello.’, ’this is nltk’] A. Ferrari (ISTI) Requirements Engineering 80 / 88
  • 81.
    Python NLTK -word tokenization >>> from nltk.tokenize import word_tokenize >>> word_tokenize("this is nltk") [’this’, ’is’, ’nltk’] A. Ferrari (ISTI) Requirements Engineering 81 / 88
  • 82.
    Python NLTK -word tokenization with punctuation >>> word_tokenize("The woman’s car") [’The’, ’woman’, "’s", ’car’] >>> from nltk.tokenize import wordpunct_tokenize >>> wordpunct_tokenize("The woman’s car") [’The’, ’woman’, "’", ’s’, ’car’] A. Ferrari (ISTI) Requirements Engineering 82 / 88
  • 83.
    Python NLTK -stemming >>> from nltk.stem.porter import PorterStemmer >>> porter_stemmer = PorterStemmer() >>> porter_stemmer.stem(’requirements’) ’requir’ A. Ferrari (ISTI) Requirements Engineering 83 / 88
  • 84.
    Python NLTK -lemmatizer >>> from nltk.stem import WordNetLemmatizer >>> wordnet_lemmatizer = WordNetLemmatizer() >>> wordnet_lemmatizer.lemmatize(’requirements’) ’requirement’ >>> wordnet_lemmatizer.lemmatize(’is’) ’is’ >>> wordnet_lemmatizer.lemmatize(’is’, pos=’v’) ’be’ A. Ferrari (ISTI) Requirements Engineering 84 / 88
  • 85.
    Python NLTK -POS tagging >>> t = word_tokenize("The system shall be nice") >>> from nltk import pos_tag >>> pos_tag(t) [(’The’, ’DT’), (’system’, ’NN’), (’shall’, ’MD’), (’be’, ’VB’), (’nice’, ’JJ’)] A. Ferrari (ISTI) Requirements Engineering 85 / 88
  • 86.
    Python NLTK -POS tagging How can I know the meaning of the tags? >>> import nltk >>> nltk.help.upenn_tagset(’DT’) DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those >>> nltk.help.upenn_tagset(’NN’) NN: noun, common, singular or mass common-carrier cabbage knuckle-duster Casino afghan shed thermostat investment slide humour falloff slick wind hyena override subhumanity machinist ... >>> nltk.help.upenn_tagset(’NNP’) NNP: noun, proper, singular Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA Shannon A.K.C. Meltex Liverpool ... A. Ferrari (ISTI) Requirements Engineering 86 / 88
  • 87.
    Python NLTK -stopword removal >>> from nltk.corpus import stopwords >>> stoplist = stopwords.words(’english’) >>> sentence = ’The system shall be usable’ >>> print [i for i in word_tokenize(sentence) if i not in stoplist] [’The’, ’system’, ’shall’, ’usable’] >>> print [i for i in word_tokenize(sentence.lower()) if i not in stoplist] [’system’, ’shall’, ’usable’] A. Ferrari (ISTI) Requirements Engineering 87 / 88
  • 88.
    When to useWhat GATE is good to annotate documents, and find patterns Python NLTK is good for fast prototyping of analysis tools If you have to extract information from text: use GATE If you have to do more fine-grained analyses, and you do not need to build a well-structured program: use Python NLTK If you have to build a structured program: study the OpenNLP library (Java) A. Ferrari (ISTI) Requirements Engineering 88 / 88