SlideShare a Scribd company logo
Theory and Practice
of Data Cleaning
Regular Expressions: From Theory to Practice
Theory of Regular Expressions
• Base elements:
• ∅ empty set, ε empty string, and Σ alphabet of characters
• For regular expressions R, S, the following are regular expressions:
• R | S alternation
• R S concatenation
• R* Kleene star
• (R) parentheses (can be omitted with precedence rules)
• Regular languages …
• generated by regular (Type-3) grammars
• recognized (accepted) by a finite automaton
• expressed by regular expressions
Regular
Grammars
• Not very handy in practice …
• Regular expressions to the
rescue!
[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?
Example: floating point numbers such as -0.314159265e+1
… can be generated by a right regular grammar G with
N = {S, A,B,C,D,E,F}, Σ = {0,1,2,3,4,5,6,7,8,9,+,-,.,e},
Production rules P =
Introduction to Regular Expressions (Regex)
Theory & Practice
• Theory of regular expressions:
• Brief introduction where regular expressions come from …
• Practice of regular expressions:
• What you need to know to get started with regex in practice!
• Demonstration of regular expressions
Practice of Regular Expressions
• Use case: Extract (then transform) data from text
• pi = -0.314159265e+1
• e = 0.2718281828E+1
• This regex will do the trick: [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?
• Character set [ … ] matches any single character
• Optional element ... ? matches 0 or 1 occurrence
• Range [0-9] matches any single character in this range
• (Kleene) Star ... * matches 0 or more occurrences
• Dot . matches any character (execept line breaks)
• Escape character  ... take next character literally (no special meaning)
• Capturing group (...) group multiple tokens; capture group for backreference
Beware of False Negatives and False Positives
• False Negative
• your pattern doe not match … although it should!
• you will notice this problem first (missing match results)
• Remedy: you need to “relax” the regex, so it matches the desired strings
• False Positive
• your pattern does match … although it shouldn’t!
• you might not notice this at first (false matches may occur sporadically)
• Remedy: you need to “tighten” the regex, so it matches fewer strings
(avoiding the false matches)
RegEx Matching as a Sport: RegEx Golf
https://xkcd.com/1313/
Division of Labor:
RegEx for Syntax; Code for Semantics
• Getting “the right” regex can be quite a balancing act
• … making RegEx Golf a real sport
• Even if there is a (near) exact regex solution, it might be
really difficult to get right, debug, maintain, etc.
Division of Labor:
RegEx for Syntax; Code for Semantics
• Better: allow some false positives, then use code to check the semantics
è keep regex for what they’re best: syntactic patterns
è use some code to check the semantics of the match
• Usually much better in practice
• and sometimes the only option, even in theory
• Example: 02/29/2000 . Is that a valid (even if non-standard) date?
• if (year is not divisible by 4) then (it is a common year)
else if (year is not divisible by 100) then (it is a leap year)
else if (year is not divisible by 400) then (it is a common year)
else (it is a leap year)
Character Classes
• . match any character except newline
• w d s match a word, digit, whitespace character, respectively
• W D S match a non-word, non-digit, non-whitespace character
• [abc] any of a, b, or c
• [^abc] match a character other than a, b, or c
• [a-g] match a character between a, b, …, g
Anchors
• ^abc match abc at the start of the string
• abc$ match abc at the end of the string
• xyzb match xyz at a word boundary
• xyzB match xyz if not at a word boundary
Escaped Characters
• . *  escaped special characters
• t n r match a tab, linefeed, carriage return
• u00A9 unicode escaped ©
Groups
• ([0-9]+)s*([a-z]+) two capture group s
• 1 backreference to group #1
• 2 1 first group #2, then #1 (simple palindrome)
Using Groups for Transformations
• Groups and backreferences are often used in transformations
• (d{2})/(d{2})/(d{4}) three capture groups for MM/DD/YYYY
• $3-$1-$2 insert captured results as: YYYY-MM-DD
•
• Use for example in Python, OpenRefine, …
Summary Regular Expressions
• Powerful language for pattern matching, extraction, transformation
• Roots in computer science theory (formal languages)
• Widely used in practice and may “save the day”
• Data extraction, Data transformation è Data quality assessment & cleaning
• … acquired taste... addictive ... special powers
https://xkcd.com/208/

More Related Content

Similar to Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice

Regular expressions
Regular expressionsRegular expressions
Regular expressions
Thomas Langston
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101
Raj Rajandran
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013
Ben Brumfield
 
Python - Lecture 7
Python - Lecture 7Python - Lecture 7
Python - Lecture 7
Ravi Kiran Khareedi
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Ignaz Wanders
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
John(Qiang) Zhang
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
Bryan Alejos
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
Tri Truong
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
Ravinder Singla
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
Jesse Anderson
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Prof. Wim Van Criekinge
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
Prof. Wim Van Criekinge
 
Regex 101
Regex 101Regex 101
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
Akhil Kaushik
 
An Introduction to Regular expressions
An Introduction to Regular expressionsAn Introduction to Regular expressions
An Introduction to Regular expressions
Yamagata Europe
 
Introduction To Regex in Lasso 8.5
Introduction To Regex in Lasso 8.5Introduction To Regex in Lasso 8.5
Introduction To Regex in Lasso 8.5
bilcorry
 
Tries data structures
Tries data structuresTries data structures
Tries data structures
Jothi Ramasamy
 
Basta mastering regex power
Basta mastering regex powerBasta mastering regex power
Basta mastering regex power
Max Kleiner
 
Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014
Sandy Smith
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
James Armes
 

Similar to Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice (20)

Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Regular Expressions 101
Regular Expressions 101Regular Expressions 101
Regular Expressions 101
 
Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013Introduction to Regular Expressions RootsTech 2013
Introduction to Regular Expressions RootsTech 2013
 
Python - Lecture 7
Python - Lecture 7Python - Lecture 7
Python - Lecture 7
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Python advanced 2. regular expression in python
Python advanced 2. regular expression in pythonPython advanced 2. regular expression in python
Python advanced 2. regular expression in python
 
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdfFUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
FUNDAMENTALS OF REGULAR EXPRESSION (RegEX).pdf
 
Regular Expressions grep and egrep
Regular Expressions grep and egrepRegular Expressions grep and egrep
Regular Expressions grep and egrep
 
Regular expressions
Regular expressionsRegular expressions
Regular expressions
 
Introduction to Regular Expressions
Introduction to Regular ExpressionsIntroduction to Regular Expressions
Introduction to Regular Expressions
 
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekingeBioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
Bioinformatics p2-p3-perl-regexes v2013-wim_vancriekinge
 
P3 2017 python_regexes
P3 2017 python_regexesP3 2017 python_regexes
P3 2017 python_regexes
 
Regex 101
Regex 101Regex 101
Regex 101
 
Regular Expressions
Regular ExpressionsRegular Expressions
Regular Expressions
 
An Introduction to Regular expressions
An Introduction to Regular expressionsAn Introduction to Regular expressions
An Introduction to Regular expressions
 
Introduction To Regex in Lasso 8.5
Introduction To Regex in Lasso 8.5Introduction To Regex in Lasso 8.5
Introduction To Regex in Lasso 8.5
 
Tries data structures
Tries data structuresTries data structures
Tries data structures
 
Basta mastering regex power
Basta mastering regex powerBasta mastering regex power
Basta mastering regex power
 
Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014Don't Fear the Regex - CapitalCamp/GovDays 2014
Don't Fear the Regex - CapitalCamp/GovDays 2014
 
Regular Expressions and You
Regular Expressions and YouRegular Expressions and You
Regular Expressions and You
 

More from Bertram Ludäscher

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Bertram Ludäscher
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Bertram Ludäscher
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
Bertram Ludäscher
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
Bertram Ludäscher
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
Bertram Ludäscher
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Bertram Ludäscher
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
Bertram Ludäscher
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
Bertram Ludäscher
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
Bertram Ludäscher
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Bertram Ludäscher
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Bertram Ludäscher
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
Bertram Ludäscher
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
Bertram Ludäscher
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
Bertram Ludäscher
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Bertram Ludäscher
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
Bertram Ludäscher
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Bertram Ludäscher
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
Bertram Ludäscher
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
Bertram Ludäscher
 

More from Bertram Ludäscher (20)

Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family ReunionGames, Queries, and Argumentation Frameworks: Time for a Family Reunion
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion
 
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
Games, Queries, and Argumentation Frameworks: Time for a Family Reunion!
 
[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules[Flashback] Integration of Active and Deductive Database Rules
[Flashback] Integration of Active and Deductive Database Rules
 
[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules[Flashback] Statelog: Integration of Active & Deductive Database Rules
[Flashback] Statelog: Integration of Active & Deductive Database Rules
 
Answering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query PatternsAnswering More Questions with Provenance and Query Patterns
Answering More Questions with Provenance and Query Patterns
 
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?Computational Reproducibility vs. Transparency: Is It FAIR Enough?
Computational Reproducibility vs. Transparency: Is It FAIR Enough?
 
Which Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A DialogueWhich Model Does Not Belong: A Dialogue
Which Model Does Not Belong: A Dialogue
 
From Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science TalesFrom Workflows to Transparent Research Objects and Reproducible Science Tales
From Workflows to Transparent Research Objects and Reproducible Science Tales
 
From Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science TalesFrom Research Objects to Reproducible Science Tales
From Research Objects to Reproducible Science Tales
 
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of UsPossible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
Possible Worlds Explorer: Datalog & Answer Set Programming for the Rest of Us
 
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine ZeitreiseDeduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
Deduktive Datenbanken & Logische Programme: Eine kleine Zeitreise
 
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
[Flashback 2005] Managing Scientific Data: From Data Integration to Scientifi...
 
Dissecting Reproducibility: A case study with ecological niche models in th...
Dissecting Reproducibility:  A case study with ecological niche models  in th...Dissecting Reproducibility:  A case study with ecological niche models  in th...
Dissecting Reproducibility: A case study with ecological niche models in th...
 
Incremental Recomputation: Those who cannot remember the past are condemned ...
Incremental Recomputation:  Those who cannot remember the past are condemned ...Incremental Recomputation:  Those who cannot remember the past are condemned ...
Incremental Recomputation: Those who cannot remember the past are condemned ...
 
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency AnnotationsValidation and Inference of Schema-Level Workflow Data-Dependency Annotations
Validation and Inference of Schema-Level Workflow Data-Dependency Annotations
 
An ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflowsAn ontology-driven framework for data transformation in scientific workflows
An ontology-driven framework for data transformation in scientific workflows
 
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses ApproachKnowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
Knowledge Representation & Reasoning and the Hierarchy-of-Hypotheses Approach
 
Whole-Tale: The Experience of Research
Whole-Tale: The Experience of ResearchWhole-Tale: The Experience of Research
Whole-Tale: The Experience of Research
 
ETC & Authors in the Driver's Seat
ETC & Authors in the Driver's SeatETC & Authors in the Driver's Seat
ETC & Authors in the Driver's Seat
 

Recently uploaded

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
fkyes25
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 

Recently uploaded (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Natural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptxNatural Language Processing (NLP), RAG and its applications .pptx
Natural Language Processing (NLP), RAG and its applications .pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
一比一原版(Glasgow毕业证书)格拉斯哥大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 

Week-2: Theory & Practice of Data Cleaning: Regular Expressions in Practice

  • 1. Theory and Practice of Data Cleaning Regular Expressions: From Theory to Practice
  • 2. Theory of Regular Expressions • Base elements: • ∅ empty set, ε empty string, and Σ alphabet of characters • For regular expressions R, S, the following are regular expressions: • R | S alternation • R S concatenation • R* Kleene star • (R) parentheses (can be omitted with precedence rules) • Regular languages … • generated by regular (Type-3) grammars • recognized (accepted) by a finite automaton • expressed by regular expressions
  • 3. Regular Grammars • Not very handy in practice … • Regular expressions to the rescue! [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)? Example: floating point numbers such as -0.314159265e+1 … can be generated by a right regular grammar G with N = {S, A,B,C,D,E,F}, Σ = {0,1,2,3,4,5,6,7,8,9,+,-,.,e}, Production rules P =
  • 4. Introduction to Regular Expressions (Regex) Theory & Practice • Theory of regular expressions: • Brief introduction where regular expressions come from … • Practice of regular expressions: • What you need to know to get started with regex in practice! • Demonstration of regular expressions
  • 5. Practice of Regular Expressions • Use case: Extract (then transform) data from text • pi = -0.314159265e+1 • e = 0.2718281828E+1 • This regex will do the trick: [-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)? • Character set [ … ] matches any single character • Optional element ... ? matches 0 or 1 occurrence • Range [0-9] matches any single character in this range • (Kleene) Star ... * matches 0 or more occurrences • Dot . matches any character (execept line breaks) • Escape character ... take next character literally (no special meaning) • Capturing group (...) group multiple tokens; capture group for backreference
  • 6. Beware of False Negatives and False Positives • False Negative • your pattern doe not match … although it should! • you will notice this problem first (missing match results) • Remedy: you need to “relax” the regex, so it matches the desired strings • False Positive • your pattern does match … although it shouldn’t! • you might not notice this at first (false matches may occur sporadically) • Remedy: you need to “tighten” the regex, so it matches fewer strings (avoiding the false matches)
  • 7. RegEx Matching as a Sport: RegEx Golf https://xkcd.com/1313/
  • 8. Division of Labor: RegEx for Syntax; Code for Semantics • Getting “the right” regex can be quite a balancing act • … making RegEx Golf a real sport • Even if there is a (near) exact regex solution, it might be really difficult to get right, debug, maintain, etc.
  • 9. Division of Labor: RegEx for Syntax; Code for Semantics • Better: allow some false positives, then use code to check the semantics è keep regex for what they’re best: syntactic patterns è use some code to check the semantics of the match • Usually much better in practice • and sometimes the only option, even in theory • Example: 02/29/2000 . Is that a valid (even if non-standard) date? • if (year is not divisible by 4) then (it is a common year) else if (year is not divisible by 100) then (it is a leap year) else if (year is not divisible by 400) then (it is a common year) else (it is a leap year)
  • 10. Character Classes • . match any character except newline • w d s match a word, digit, whitespace character, respectively • W D S match a non-word, non-digit, non-whitespace character • [abc] any of a, b, or c • [^abc] match a character other than a, b, or c • [a-g] match a character between a, b, …, g
  • 11. Anchors • ^abc match abc at the start of the string • abc$ match abc at the end of the string • xyzb match xyz at a word boundary • xyzB match xyz if not at a word boundary
  • 12. Escaped Characters • . * escaped special characters • t n r match a tab, linefeed, carriage return • u00A9 unicode escaped ©
  • 13. Groups • ([0-9]+)s*([a-z]+) two capture group s • 1 backreference to group #1 • 2 1 first group #2, then #1 (simple palindrome)
  • 14. Using Groups for Transformations • Groups and backreferences are often used in transformations • (d{2})/(d{2})/(d{4}) three capture groups for MM/DD/YYYY • $3-$1-$2 insert captured results as: YYYY-MM-DD • • Use for example in Python, OpenRefine, …
  • 15. Summary Regular Expressions • Powerful language for pattern matching, extraction, transformation • Roots in computer science theory (formal languages) • Widely used in practice and may “save the day” • Data extraction, Data transformation è Data quality assessment & cleaning • … acquired taste... addictive ... special powers https://xkcd.com/208/