Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!
Rule-based Information Extraction is DEAD
Long Live Rule-based Information Extraction Systems!
Laura Chiticariu, Yunyao Li, Frederick Reiss
IBM Research - Almaden
THE DISCONNECT: ACADEMIC vs. INDUSTRY
Implementations of Entity Extraction
Entity Extraction Papers by Year
3.5%
21%
100%
RuleBased
Hybrid
45%
50%
RuleBased
22%
75%
17%
Hybrid
17%
Machine
Learning
Based
33%
0%
NLP Papers
(2003-2012)
All Vendors
Large Vendors
Machine
Learning
Based
Fraction of NLP Papers
67%
Commercial Products
(2013)
Year of Publication
THE EXPLANATIONS
Academia
Rule-based IE
PROs
•Declarative Heuristic
•Easy to comprehend
•Easy to maintain
•Easy to incorporate domain
knowledge
•Easy to debug
ML-based IE
PROs
•Trainable
•Adaptable
•Reduces manual effort
CONs
CONs
• Heuristic
•Requires tedious manual
labor
Industry
•Requires labeled data
•Requires retraining for
domain adaptation
•Requires ML expertise to
use or maintain
• Opaque
Evaluating
Benefits
Evaluating IE on its own of IE
Precision and Recall
Evaluating
Costs
of IE
Labor cost of writing
rules
Evaluating IE as part of
a larger process
Using ill-defined metrics
that are subject to change
Labor cost
Hardware cost
Business risk
Others
What’s the research in
Rule-based IE?
BRIDGING THE GAP
Where is the research in rule-based IE? Making it more principled, effective, and efficient
Define standard IE rule language and data model.
• What is the right data model to capture text, annotations over text, and their properties?
• Can we establish a standard declarative extensible rule language to solve most IE tasks encountered so far?
Systems research based on standard IE rule language.
• Data representation
• Automatic performance optimization
• Exploring modern hardware …
ML research based on standard IE rule language
• How to learn basic primitives such as regular expressions and dictionaries?
• How to automatically generate rules that are understandable and maintainable?