SlideShare a Scribd company logo
Regular Expressions
& Regular Languages
slideshare: http://www.slideshare.net/marinasantini1/regular-expressions-and-regular-languages
Mathematics for Language Technology
http://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/
Last Updated 6 March 2015
Marina Santini
santinim@stp.lingfil.uu.se
Department of Linguistics and Philology
Uppsala University, Uppsala, Sweden
Spring 2015
1
Acknowledgements
 Several	
  slides	
  borrowed	
  from	
  Jurafsky	
  and	
  Mar6n	
  
(2009).	
  
 Prac6cal	
  ac6vi6es	
  by	
  Mats	
  Dahllöf	
  and	
  Jurafsky	
  and	
  
Mar6n	
  (2009).	
  
2
Reading
 Required Reading:
  E&G (2013): Ch. 9 (pp. 252-256)
  Compendium (3): 7.2, 7.3, 8.2.3
  Mats Dahllöf: Reguljära uttryck
•  http://stp.lingfil.uu.se/~matsd/uv/uv14/mfst/dok/oh6.pdf
 Further Reading:
  Chapters	
  2	
  in	
  Jurafsky	
  D.	
  &	
  Mar6n	
  J.	
  (2009)	
  Speech	
  and	
  Language	
  Processing:	
  
An	
  introduc5on	
  to	
  natural	
  language	
  processing,	
  computa5onal	
  linguis5cs,	
  and	
  
speech	
  recogni5on.	
  Online	
  draG	
  version:	
  hIp://stp.lingfil.uu.se/~san6nim/ml/2014/
JurafskyMar6nSpeechAndLanguageProcessing2ed_draG%202007.pdf	
  
3
Outline
 Regular Expressions
 Regular Languages
 Practical Activities
 (Pumping Lemma)
4
5
Regular Expressions
Definitions
Equivalence to Finite Automata
6
Regular Expressions and Text Searching
 Everybody does it
  Emacs, vi, perl, grep, etc..
 Regular expressions are a compact
textual representation of a set of strings
representing a language.
7
Example
 Find all the instances of the word “the”
in a text.
  /the/
  /[tT]he/
  /b[tT]heb/
8
Errors
 The process we just went through was
based on two fixing kinds of errors
  Matching strings that we should not have
matched (there, then, other)
•  False positives (Type I)
  Not matching things that we should have
matched (The)
•  False negatives (Type II)
9
Errors
 Reducing the error rate for an application
often involves two antagonistic efforts:
  Increasing accuracy, or precision, (minimizing
false positives)
  Increasing coverage, or recall, (minimizing
false negatives).
10
REs: What are they?
 Regular expressions describe
languages by an algebra.
Link: https://www.youtube.com/watch?v=eOfMcdeyrMU
11
DFA
12
Converting the regular expression
(a|b)* to a DFA
13
Converting the regular expression
(a*|b*)* to a DFA
14
Converting the regular expression
ab(a|b)* to a DFA
15
Remember Jeff Ullman video?
16
17
Operations on Languages
 REs use three operations:
  union
  concatenation
  Kleene star (*) [cleany star]
Union ∪ (aka: disjunction, OR, |, +)
 The union of languages is the usual
thing, since languages are sets.
 Example: {01,111,10}∪{00, 01} =
{01,111,10,00}.
18
01 happens to be in both
sets, so it will be once in the
union
19
Concatenation: represented by juxtaposition (no punctuation)
or middle dot ( · )
 The concatenation of languages
L and M is denoted LM.
 It contains every string wx such
that w is in L and x is in M.
 Example: {01,111,10}{00, 01}
= {0100, 0101, 11100, 11101,
1000, 1001}. In the example, we take 01 from the first language,
and we concatenate it with 00 in the second language.
That gives us 0100.
We then take 01 from the first language again, and we
concatenate it with 01 in the second language, and that
gives us 0101.
Then we take 111 from the first language and we
concatenated it with 00 in the second language and
this gives us 11100
…. and so on.
20
Kleene Star: represented by an asterisk
aka star (*)
 If L is a language, then L*, the Kleene
star or just “star,” is the set of strings
formed by concatenating zero or more
strings from L, in any order.
 L* = {ε} ∪ L ∪ LL ∪ LLL ∪ …
 Example: {0,10}* = {ε, 0, 10, 00, 010,
100, 1010,…}
If you take no strings from L, that would give you the empty string.
IMPORTANT!
 FROM NOW ON, LET’S STICK TO THE
FOLLOWING CONVENTIONS (OTHERWISE WE
WILL BE CONFUSED):
  Union ∪ (aka: disjunction, OR) represented by: | or +
  Concatenation: represented by juxtaposition (= no
punctuation) or middle dot ( · )
  Kleene Star: represented by *
21
22
Precedence of Operators
 Parentheses may be used wherever
needed to influence the grouping of
operators.
 Order of precedence is * (highest), then
concatenation, then + (lowest).
Remember: + = union/disjunction
23
Examples: REs
1.  L(01) = {01}.
2.  L(01+0) = {01, 0}.
3.  L(0(1+0)) = {01, 00}.
  Note order of precedence of
operators.
4.  L(0*) = {ε, 0, 00, 000,… }.
5.  L((0+10)*(ε+1)) = all strings
of 0s and 1s without two
consecutive 1s.
1) The regular expression 01 represents the
concatenation of the language consisting of one
string, 0 and the language consisting of one string, 1.
The result is the language containing the one string
01.
2) The language of 01+0 is the union of the language
containing only string 01 and the language containing
only string 0.
3) The language of 0 concatenated with 1+0 is the
two strings 01 and 00. Notice that we need
parentheses to force the + to group first. Without
them, since concatenation takes precedence over +,
we get the interpretation in the second example.
4) The language of 0* is the star of the language
containing only the string 0. This is all strings of 0’s,
including the empty string.
5) This example denotes the language with all strings
of 0s and 1s without two consecutive 0s. To see why
this works, in every such string, each 1 is either
followed immediately by a 0, or it comes at the end of
the string. (0+10)* denotes all strings in which every
1 is followed by a 0. These strings are surely in the
language we want. But we also want these strings
followed by a final 1. Thus, we concatenate the
language of (0+10)* with epsilon+1. This
concatenation gives us all the strings where 1s are
followed by 0s, plus all those strings with an
additional 1 at the end.
24
Equivalence of REs and Finite
Automata
 For every RE, there is a finite automaton
that accepts the same language.
 And we need to show that for every finite
automaton, there is a RE defining its
language.
25
Summary
Automata and regular expressions define
exactly the same set of languages: the
regular languages.
REGULAR LANGUAGES
26
27
The Chomsky Hierachy
Regular
(DFA)
Context-
free
(PDA)
Context-
sensitive
(LBA)
Recursively-
enumerable
(TM)
•  Hierarchy of classes of formal languages
One language is of greater generative power or complexity than another if
it can define a language that other cannot define. Context-free grammars
are more powerful that regular grammars
28
Regular Languages
 A language L is regular if it is the
language accepted by some DFA.
  Note: the DFA must accept only the strings
in L, no others.
 Some languages are not regular.
Only languages that meet the following criteria
are regular languages:
29
  Regular language derive their name from the fact that the
strings they recognize are (in a formal computer science sense)
“regular.”
  This implies that there are certain kinds of strings that it will be
very hard, if not impossible, to recognize with regular
expressions, especially nested syntactic structures in natural
language.
30
Formal languages vs regular
languages
 A formal language is a set of strings,
each string composed of symbols from
a finite set called an alphabet.
  Ex: {a,b!}
 Formal languages are not the same as
regular languages….
31
32
But Many Languages are Regular
 They appear in many contexts and have
many useful properties.
How to tell if a language is not regular
 The most common way to prove that a
language is regular is to build a regular
expression for the language.
33
Pumping Lemma
34
Prac6cal	
  Ac6vity	
  1	
  
 The	
  language	
  L	
  contains	
  all	
  strings	
  over	
  the	
  
alphabet	
  {a,b}	
  that	
  begin	
  with	
  a	
  and	
  end	
  with	
  b,	
  
ie:	
  
 Write a regular expression that defines
the language L.	
  	
  	
  
35
Practical Activity 1:
Possible Solution
36
Your Solutions
37
In between the concatenation of a
and b there must be 0 or more
unions (disjuctions) of a and b.
Reference: slides 17-22
Practical Activity 2
 Draw a deterministic finite-state automaton
that accepts the following regular expression:
38
( (ab) | c)*
Alternative notation style:
ie: 0 or more occurences of
the disjunction ab | c
Test the
automaton with
these legal strings
in the language :
0
abc
a
ab
cccabc
cbacccabababccc
….
Practical Activity 2:
Possible Correct Solution
39
Having the initial state as a final state gives us the empty string as an element in the language.
Your solutions (1): when we interpret ”+” as
disjunction, these solutions are wrong because
”c” happens only after ”a” and ”b”…
40
Test
these
automata
with the
string on
slide 35
Your solutions (2): same as
previous slide. In addition, here no
final states are shown…
41
Test
these
automata
with the
string on
slide 35
Practical Activity 3
  Construct a grep regular expression that
matches patterns containing at least one
“ab” followed by any number of bs.
  Construct a grep regular expression that
matches any number between 1000 and
9999.
42
Practical Activity 3:
Possible Solutions
  grep ‘(ab)+b*’
  [1-9][0-9]{3}
43
Exercises: E&G (2013)
 Övning 9.40
 Optional: as many as you can
 AGer	
  having	
  completed	
  the	
  exercises,	
  
check	
  out	
  the	
  solu6ons	
  at	
  the	
  end	
  of	
  the	
  
book.	
  	
  	
  
44
The End
45

More Related Content

What's hot

Deciability (automata presentation)
Deciability (automata presentation)Deciability (automata presentation)
Deciability (automata presentation)
Sagar Kumar
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
Mukesh Tekwani
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
shah zeb
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Shiraz316
 
Automata theory -RE to NFA-ε
Automata theory -RE to  NFA-εAutomata theory -RE to  NFA-ε
Automata theory -RE to NFA-ε
Akila Krishnamoorthy
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expression
Animesh Chaturvedi
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
meresie tesfay
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
Tsegazeab Asgedom
 
Pumping lemma for cfl
Pumping lemma for cflPumping lemma for cfl
Pumping lemma for cfl
Muhammad Zohaib Chaudhary
 
Regular Languages
Regular LanguagesRegular Languages
Regular Languages
parmeet834
 
Theory of Automata
Theory of AutomataTheory of Automata
Theory of Automata
Farooq Mian
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory
Rajendran
 
Universal turing coastus
Universal turing coastusUniversal turing coastus
Universal turing coastus
Shiraz316
 
4.3 techniques for turing machines construction
4.3 techniques for turing machines construction4.3 techniques for turing machines construction
4.3 techniques for turing machines construction
Sampath Kumar S
 
Chomsky classification of Language
Chomsky classification of LanguageChomsky classification of Language
Chomsky classification of Language
Dipankar Boruah
 
Alphabets , strings, languages and grammars
Alphabets , strings, languages  and grammarsAlphabets , strings, languages  and grammars
Alphabets , strings, languages and grammars
hele987
 
Topological Sorting
Topological SortingTopological Sorting
Topological Sorting
ShahDhruv21
 
Lesson 03
Lesson 03Lesson 03
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Ashish Duggal
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
shah zeb
 

What's hot (20)

Deciability (automata presentation)
Deciability (automata presentation)Deciability (automata presentation)
Deciability (automata presentation)
 
Finite Automata
Finite AutomataFinite Automata
Finite Automata
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Automata theory -RE to NFA-ε
Automata theory -RE to  NFA-εAutomata theory -RE to  NFA-ε
Automata theory -RE to NFA-ε
 
Regular language and Regular expression
Regular language and Regular expressionRegular language and Regular expression
Regular language and Regular expression
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Chapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata TheoryChapter1 Formal Language and Automata Theory
Chapter1 Formal Language and Automata Theory
 
Pumping lemma for cfl
Pumping lemma for cflPumping lemma for cfl
Pumping lemma for cfl
 
Regular Languages
Regular LanguagesRegular Languages
Regular Languages
 
Theory of Automata
Theory of AutomataTheory of Automata
Theory of Automata
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory
 
Universal turing coastus
Universal turing coastusUniversal turing coastus
Universal turing coastus
 
4.3 techniques for turing machines construction
4.3 techniques for turing machines construction4.3 techniques for turing machines construction
4.3 techniques for turing machines construction
 
Chomsky classification of Language
Chomsky classification of LanguageChomsky classification of Language
Chomsky classification of Language
 
Alphabets , strings, languages and grammars
Alphabets , strings, languages  and grammarsAlphabets , strings, languages  and grammars
Alphabets , strings, languages and grammars
 
Topological Sorting
Topological SortingTopological Sorting
Topological Sorting
 
Lesson 03
Lesson 03Lesson 03
Lesson 03
 
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDAPush Down Automata (PDA) | TOC  (Theory of Computation) | NPDA | DPDA
Push Down Automata (PDA) | TOC (Theory of Computation) | NPDA | DPDA
 
Lecture 3,4
Lecture 3,4Lecture 3,4
Lecture 3,4
 

Similar to Lecture: Regular Expressions and Regular Languages

PART A.doc
PART A.docPART A.doc
Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5
Dr. Maamoun Ahmed
 
hghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggghghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggg
adugnanegero
 
Unit ii
Unit iiUnit ii
Unit ii
TPLatchoumi
 
01-Introduction&Languages.pdf
01-Introduction&Languages.pdf01-Introduction&Languages.pdf
01-Introduction&Languages.pdf
TariqSaeed80
 
RegularLanguage.pptx
RegularLanguage.pptxRegularLanguage.pptx
RegularLanguage.pptx
TapasBhadra1
 
Flat unit 1
Flat unit 1Flat unit 1
Flat unit 1
VenkataRaoS1
 
rs1.ppt
rs1.pptrs1.ppt
rs1.ppt
ssuser47f7f2
 
1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt
Marvin886766
 
RegularExpressions.pdf
RegularExpressions.pdfRegularExpressions.pdf
RegularExpressions.pdf
ImranBhatti58
 
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdfAutomata_Theory_and_compiler_design_UNIT-1.pptx.pdf
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
TONY562
 
L_2_apl.pptx
L_2_apl.pptxL_2_apl.pptx
L_2_apl.pptx
ReehaamMalikArain
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
ankitamakin
 
10651372.ppt
10651372.ppt10651372.ppt
10651372.ppt
ssuserf3a6ff
 
QB104545.pdf
QB104545.pdfQB104545.pdf
QB104545.pdf
MrRRajasekarCSE
 
Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdf
DrIsikoIsaac
 
Automata
AutomataAutomata
Automata
Gaditek
 
Automata
AutomataAutomata
Automata
Gaditek
 
FSM.pdf
FSM.pdfFSM.pdf
FSM.pdf
student
 

Similar to Lecture: Regular Expressions and Regular Languages (20)

PART A.doc
PART A.docPART A.doc
PART A.doc
 
Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5Theory of Computation - Lectures 4 and 5
Theory of Computation - Lectures 4 and 5
 
hghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggghghghghhghghgggggggggggggggggggggggggggggggggg
hghghghhghghgggggggggggggggggggggggggggggggggg
 
Unit ii
Unit iiUnit ii
Unit ii
 
01-Introduction&Languages.pdf
01-Introduction&Languages.pdf01-Introduction&Languages.pdf
01-Introduction&Languages.pdf
 
RegularLanguage.pptx
RegularLanguage.pptxRegularLanguage.pptx
RegularLanguage.pptx
 
Flat unit 1
Flat unit 1Flat unit 1
Flat unit 1
 
rs1.ppt
rs1.pptrs1.ppt
rs1.ppt
 
1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt1LECTURE 8 Regular_Expressions.ppt
1LECTURE 8 Regular_Expressions.ppt
 
RegularExpressions.pdf
RegularExpressions.pdfRegularExpressions.pdf
RegularExpressions.pdf
 
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdfAutomata_Theory_and_compiler_design_UNIT-1.pptx.pdf
Automata_Theory_and_compiler_design_UNIT-1.pptx.pdf
 
L_2_apl.pptx
L_2_apl.pptxL_2_apl.pptx
L_2_apl.pptx
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
 
Dfa basics
Dfa basicsDfa basics
Dfa basics
 
10651372.ppt
10651372.ppt10651372.ppt
10651372.ppt
 
QB104545.pdf
QB104545.pdfQB104545.pdf
QB104545.pdf
 
Chapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdfChapter2CDpdf__2021_11_26_09_19_08.pdf
Chapter2CDpdf__2021_11_26_09_19_08.pdf
 
Automata
AutomataAutomata
Automata
 
Automata
AutomataAutomata
Automata
 
FSM.pdf
FSM.pdfFSM.pdf
FSM.pdf
 

More from Marina Santini

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Marina Santini
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Marina Santini
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
Marina Santini
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
Marina Santini
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
Marina Santini
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
Marina Santini
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
Marina Santini
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
Marina Santini
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
Marina Santini
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
Marina Santini
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
Marina Santini
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
Marina Santini
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
Marina Santini
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
Marina Santini
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
Marina Santini
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
Marina Santini
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
Marina Santini
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
Marina Santini
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
Marina Santini
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 

More from Marina Santini (20)

Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...
 
Towards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsTowards a Quality Assessment of Web Corpora for Language Technology Applications
Towards a Quality Assessment of Web Corpora for Language Technology Applications
 
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
A Web Corpus for eCare: Collection, Lay Annotation and Learning -First Results-
 
An Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability FeaturesAn Exploratory Study on Genre Classification using Readability Features
An Exploratory Study on Genre Classification using Readability Features
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Lecture: Summarization
Lecture: SummarizationLecture: Summarization
Lecture: Summarization
 
Relation Extraction
Relation ExtractionRelation Extraction
Relation Extraction
 
Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)Lecture: Vector Semantics (aka Distributional Semantics)
Lecture: Vector Semantics (aka Distributional Semantics)
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Lecture: Word Senses
Lecture: Word SensesLecture: Word Senses
Lecture: Word Senses
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)Lecture 9: Machine Learning in Practice (2)
Lecture 9: Machine Learning in Practice (2)
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Lecture 5: Interval Estimation
Lecture 5: Interval Estimation Lecture 5: Interval Estimation
Lecture 5: Interval Estimation
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
 

Recently uploaded

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
MIRIAMSALINAS13
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Thiyagu K
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
Jean Carlos Nunes Paixão
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
Jisc
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
Peter Windle
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
Tamralipta Mahavidyalaya
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 

Recently uploaded (20)

Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXXPhrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
Phrasal Verbs.XXXXXXXXXXXXXXXXXXXXXXXXXX
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
Unit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdfUnit 2- Research Aptitude (UGC NET Paper I).pdf
Unit 2- Research Aptitude (UGC NET Paper I).pdf
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
Lapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdfLapbook sobre os Regimes Totalitários.pdf
Lapbook sobre os Regimes Totalitários.pdf
 
How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...How libraries can support authors with open access requirements for UKRI fund...
How libraries can support authors with open access requirements for UKRI fund...
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
Embracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic ImperativeEmbracing GenAI - A Strategic Imperative
Embracing GenAI - A Strategic Imperative
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Home assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdfHome assignment II on Spectroscopy 2024 Answers.pdf
Home assignment II on Spectroscopy 2024 Answers.pdf
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
Chapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptxChapter 3 - Islamic Banking Products and Services.pptx
Chapter 3 - Islamic Banking Products and Services.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 

Lecture: Regular Expressions and Regular Languages

  • 1. Regular Expressions & Regular Languages slideshare: http://www.slideshare.net/marinasantini1/regular-expressions-and-regular-languages Mathematics for Language Technology http://stp.lingfil.uu.se/~matsd/uv/uv15/mfst/ Last Updated 6 March 2015 Marina Santini santinim@stp.lingfil.uu.se Department of Linguistics and Philology Uppsala University, Uppsala, Sweden Spring 2015 1
  • 2. Acknowledgements  Several  slides  borrowed  from  Jurafsky  and  Mar6n   (2009).    Prac6cal  ac6vi6es  by  Mats  Dahllöf  and  Jurafsky  and   Mar6n  (2009).   2
  • 3. Reading  Required Reading:   E&G (2013): Ch. 9 (pp. 252-256)   Compendium (3): 7.2, 7.3, 8.2.3   Mats Dahllöf: Reguljära uttryck •  http://stp.lingfil.uu.se/~matsd/uv/uv14/mfst/dok/oh6.pdf  Further Reading:   Chapters  2  in  Jurafsky  D.  &  Mar6n  J.  (2009)  Speech  and  Language  Processing:   An  introduc5on  to  natural  language  processing,  computa5onal  linguis5cs,  and   speech  recogni5on.  Online  draG  version:  hIp://stp.lingfil.uu.se/~san6nim/ml/2014/ JurafskyMar6nSpeechAndLanguageProcessing2ed_draG%202007.pdf   3
  • 6. 6 Regular Expressions and Text Searching  Everybody does it   Emacs, vi, perl, grep, etc..  Regular expressions are a compact textual representation of a set of strings representing a language.
  • 7. 7 Example  Find all the instances of the word “the” in a text.   /the/   /[tT]he/   /b[tT]heb/
  • 8. 8 Errors  The process we just went through was based on two fixing kinds of errors   Matching strings that we should not have matched (there, then, other) •  False positives (Type I)   Not matching things that we should have matched (The) •  False negatives (Type II)
  • 9. 9 Errors  Reducing the error rate for an application often involves two antagonistic efforts:   Increasing accuracy, or precision, (minimizing false positives)   Increasing coverage, or recall, (minimizing false negatives).
  • 10. 10 REs: What are they?  Regular expressions describe languages by an algebra.
  • 13. Converting the regular expression (a|b)* to a DFA 13
  • 14. Converting the regular expression (a*|b*)* to a DFA 14
  • 15. Converting the regular expression ab(a|b)* to a DFA 15
  • 16. Remember Jeff Ullman video? 16
  • 17. 17 Operations on Languages  REs use three operations:   union   concatenation   Kleene star (*) [cleany star]
  • 18. Union ∪ (aka: disjunction, OR, |, +)  The union of languages is the usual thing, since languages are sets.  Example: {01,111,10}∪{00, 01} = {01,111,10,00}. 18 01 happens to be in both sets, so it will be once in the union
  • 19. 19 Concatenation: represented by juxtaposition (no punctuation) or middle dot ( · )  The concatenation of languages L and M is denoted LM.  It contains every string wx such that w is in L and x is in M.  Example: {01,111,10}{00, 01} = {0100, 0101, 11100, 11101, 1000, 1001}. In the example, we take 01 from the first language, and we concatenate it with 00 in the second language. That gives us 0100. We then take 01 from the first language again, and we concatenate it with 01 in the second language, and that gives us 0101. Then we take 111 from the first language and we concatenated it with 00 in the second language and this gives us 11100 …. and so on.
  • 20. 20 Kleene Star: represented by an asterisk aka star (*)  If L is a language, then L*, the Kleene star or just “star,” is the set of strings formed by concatenating zero or more strings from L, in any order.  L* = {ε} ∪ L ∪ LL ∪ LLL ∪ …  Example: {0,10}* = {ε, 0, 10, 00, 010, 100, 1010,…} If you take no strings from L, that would give you the empty string.
  • 21. IMPORTANT!  FROM NOW ON, LET’S STICK TO THE FOLLOWING CONVENTIONS (OTHERWISE WE WILL BE CONFUSED):   Union ∪ (aka: disjunction, OR) represented by: | or +   Concatenation: represented by juxtaposition (= no punctuation) or middle dot ( · )   Kleene Star: represented by * 21
  • 22. 22 Precedence of Operators  Parentheses may be used wherever needed to influence the grouping of operators.  Order of precedence is * (highest), then concatenation, then + (lowest). Remember: + = union/disjunction
  • 23. 23 Examples: REs 1.  L(01) = {01}. 2.  L(01+0) = {01, 0}. 3.  L(0(1+0)) = {01, 00}.   Note order of precedence of operators. 4.  L(0*) = {ε, 0, 00, 000,… }. 5.  L((0+10)*(ε+1)) = all strings of 0s and 1s without two consecutive 1s. 1) The regular expression 01 represents the concatenation of the language consisting of one string, 0 and the language consisting of one string, 1. The result is the language containing the one string 01. 2) The language of 01+0 is the union of the language containing only string 01 and the language containing only string 0. 3) The language of 0 concatenated with 1+0 is the two strings 01 and 00. Notice that we need parentheses to force the + to group first. Without them, since concatenation takes precedence over +, we get the interpretation in the second example. 4) The language of 0* is the star of the language containing only the string 0. This is all strings of 0’s, including the empty string. 5) This example denotes the language with all strings of 0s and 1s without two consecutive 0s. To see why this works, in every such string, each 1 is either followed immediately by a 0, or it comes at the end of the string. (0+10)* denotes all strings in which every 1 is followed by a 0. These strings are surely in the language we want. But we also want these strings followed by a final 1. Thus, we concatenate the language of (0+10)* with epsilon+1. This concatenation gives us all the strings where 1s are followed by 0s, plus all those strings with an additional 1 at the end.
  • 24. 24 Equivalence of REs and Finite Automata  For every RE, there is a finite automaton that accepts the same language.  And we need to show that for every finite automaton, there is a RE defining its language.
  • 25. 25 Summary Automata and regular expressions define exactly the same set of languages: the regular languages.
  • 27. 27 The Chomsky Hierachy Regular (DFA) Context- free (PDA) Context- sensitive (LBA) Recursively- enumerable (TM) •  Hierarchy of classes of formal languages One language is of greater generative power or complexity than another if it can define a language that other cannot define. Context-free grammars are more powerful that regular grammars
  • 28. 28 Regular Languages  A language L is regular if it is the language accepted by some DFA.   Note: the DFA must accept only the strings in L, no others.  Some languages are not regular.
  • 29. Only languages that meet the following criteria are regular languages: 29
  • 30.   Regular language derive their name from the fact that the strings they recognize are (in a formal computer science sense) “regular.”   This implies that there are certain kinds of strings that it will be very hard, if not impossible, to recognize with regular expressions, especially nested syntactic structures in natural language. 30
  • 31. Formal languages vs regular languages  A formal language is a set of strings, each string composed of symbols from a finite set called an alphabet.   Ex: {a,b!}  Formal languages are not the same as regular languages…. 31
  • 32. 32 But Many Languages are Regular  They appear in many contexts and have many useful properties.
  • 33. How to tell if a language is not regular  The most common way to prove that a language is regular is to build a regular expression for the language. 33
  • 35. Prac6cal  Ac6vity  1    The  language  L  contains  all  strings  over  the   alphabet  {a,b}  that  begin  with  a  and  end  with  b,   ie:    Write a regular expression that defines the language L.       35
  • 37. Your Solutions 37 In between the concatenation of a and b there must be 0 or more unions (disjuctions) of a and b. Reference: slides 17-22
  • 38. Practical Activity 2  Draw a deterministic finite-state automaton that accepts the following regular expression: 38 ( (ab) | c)* Alternative notation style: ie: 0 or more occurences of the disjunction ab | c Test the automaton with these legal strings in the language : 0 abc a ab cccabc cbacccabababccc ….
  • 39. Practical Activity 2: Possible Correct Solution 39 Having the initial state as a final state gives us the empty string as an element in the language.
  • 40. Your solutions (1): when we interpret ”+” as disjunction, these solutions are wrong because ”c” happens only after ”a” and ”b”… 40 Test these automata with the string on slide 35
  • 41. Your solutions (2): same as previous slide. In addition, here no final states are shown… 41 Test these automata with the string on slide 35
  • 42. Practical Activity 3   Construct a grep regular expression that matches patterns containing at least one “ab” followed by any number of bs.   Construct a grep regular expression that matches any number between 1000 and 9999. 42
  • 43. Practical Activity 3: Possible Solutions   grep ‘(ab)+b*’   [1-9][0-9]{3} 43
  • 44. Exercises: E&G (2013)  Övning 9.40  Optional: as many as you can  AGer  having  completed  the  exercises,   check  out  the  solu6ons  at  the  end  of  the   book.       44