SlideShare a Scribd company logo
1 of 27
SEER:
Auto-Generating Information
Extraction Rules from User-Specified
Examples
Maeda F. Hanafi, Azza
Abouzied, Laura Chiticariu,
Yunyao Li
Outline
• Motivation
• SEER
• Demo
• Technical Overview
• Experimental Results
• Conclusion/Future Works
What is Information Extraction (IE)?
FBI Press Release Dataset
Document Crime Changes
January2016.html offenses increase of 5.3 percent
January2016.html violent crimes in the US decreased 0.6 percent
January2016.html arson crimes decreased 1.1 percent
July2016.html property crimes rose 4.5 percent
July2016.html 5 percent rise in race-based hate crimes
July2016.html offenses decreased 1.6 percent
January2015.html arson crimes increased 0.3 percent
January2015.html offenses rose 1.6 percent
January2015.html property crimes rose 7.0 percent
SEER
Existing Works in Information Extraction
Hand-written extraction rules
or scripts
• Cons:
• Labor-intensive
• Skill-dependent
• Pros:
• Debuggable results
• Customizable rules
• Often used by industry
Supervised machine learning
models
• Cons:
• Large labeled datasets
• Requires retraining for
domain adaptation
• Non-debuggable results
• Pros:
• Quick to generate results
• Popularly researched
Why we built SEER?
SEER
• Pros:
• Quick to generate results/rules
• Debuggable results
• Customizable rules
• Additional pros:
• No programming knowledge required
• Easily encode domain knowledge;
mimics how a human builds rules
ML PL
Heuristics
designed by
observing
developers
Goal and Scope
• SEER
• Helps end users build rules that specify and extract
fixed patterns of text
• Automates the process of manually building rules
• Reduces human-errors in constructing rules
Demo
Applications of IE
• Social Media Analytics
Extract user ratings from Yelp reviews.
• Financial Analytics
Extract key people and their relationships with major US
companies from SEC data.
• Healthcare Analytics
Extract data from health forms to understand the side effects
of certain drugs.
Technical Overview
Visual Annotation Query Language (VAQL) Rules
A rule = sequence of primitives; primitives extract 1+ tokens
1) Pre-built 5.4 percentP: Percentage
percentL: ‘percent’2) Literal
%D: {percent, %}3) Dictionary
percentR: [A-Za-z]+4) Regex
any tokenT: 0-15) Token gap
5.4 percentP: Integer T: 0-2 L: ‘percent’Rule:
5.4 percent 5 . 4 percenthas 4 tokens
How do we learn rules?
1. Enumerate possible primitives per example token
2. Assign scores to primitives
3. Generate rules in a tree for each example
4. Intersect trees to find the set of rules that extract all
the given examples
• Maintain disjoint trees for non-intersectable
examples.
5. Prune, rank, and suggest rules to user
1) Enumerate possible primitives per token
5
P: Percentage
L: ‘5’
R: [0-9]+
percent
5 percent
L: ‘percent’
R: [A-Za-z]+
P: Number
up
5 percent upExample: Tokens: Primitives:
T: 0-1
L: ‘up’
R: [A-Za-z]+
P: Integer
2) Assign scores to primitives
For semantic tokens:
L: ‘-’
P: City
Pre-builts≺ Dictionary
LiteralToken gap
Regex
For syntactic tokens:
Dictionary
Literal Token gap
Regex
L: ‘Dubai’T: 0-1Dubai :
- : T: 0-1
≺
≺ ≺
≺
≺
1.00.0
1.0
3) Generate rules in a tree
5 percent
L: ‘5’ = 0.4
P: Percentage = 1.0
R: [A-Za-z]+ = 0.2
L: ‘percent’ = 0.4
R: [A-Za-z]+ = 0.2
Tokens:
5 percentExample:
Tree:
Rule: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4
R: [0-9]+ = 0.2
L: ‘percent’ = 0.4
4) Intersect trees
L: ‘5’ = 0.4
P: Percentage = 1.0
R: [A-Za-z]+ = 0.2
L: ‘percent’ = 0.4
R: [A-Za-z]+ = 0.2
5 percentExample:
R: [0-9]+ = 0.2
L: ‘percent’ = 0.4
L: ‘6’ = 0.4
P: Percentage = 1.0
R: symbols = 0.2
L: ‘%’ = 0.4
R: symbols = 0.2
R: [0-9]+ = 0.2
L: ‘%’ = 0.4
D: {5, 6} = 0.4
P: Percentage = 1.0
R: [0-9]+ = 0.2
6%Example:
6%
Intersect:
5 percent
D: {percent, %} = 0.4
D: {percent, %} = 0.4
5) Prune, rank, and suggest rules to user
Suggested Rules:
P: Percentage = 1.0 = 1.0
5 percentExamples: 6%
D: {5, 6} = 0.4 = 0.4D: {percent, %} = 0.4
R: [0-9]+ = 0.2 = 0.3D: {percent, %} = 0.4
Experiments
Do users build better rules in a shorter amount
of time with SEER than without it?
Experiment Setup
• 13 participants
• All participants used both VINERy and SEER in random
order
• VINERy: IBM’s drag-and-drop interface for building
rules in VAQL manually.
• Users were trained for each tool
• Participants built rules to complete information extraction
tasks on different datasets
• Order of tools and dataset is randomized to minimize
learning effects
• Measured: duration, precision, recall
Experiment Tasks
• IE tasks were designed for novice users to complete in 10
minutes
Task Type Dataset Task Example
Easy:
one primitive
IBM Extract all currency amounts $4.5 million
FBI Extract all percentages 4.5 percent
Medium:
multiple
primitives
IBM
Extract cash flow and Original
Equipment Management revenues
cash flow of $4
million
FBI
Extract all percentage increases or
decreases in offenses
offenses rose 4.4
percent
Hard:
Disjunction of
Two rules
IBM Extract all yearly quarter phrases fourth-quarter 2004
FBI Extract all population ranges
10 to 10,000 in
population
Users build better rules faster with SEER than
without it
• Rules in SEER were better:
• Duration in creating rules
• Correctness of rule extractions
• Black lines: Mean duration of completing task
• Gray boxes: 95% mean confidence intervals
• Higher F1 scores indicate that user’s rules have higher precision and recall
scores
Summary
• SEER combines the advantages of ML and PL
• SEER: an end-to-end IE workflow
• Synthesizes rules from user examples
• Uses heuristics to encode domain knowledge: mimics
how a human builds rules
• Experiments
• SEER helps user build more accurate and precise
rules quickly
Future Works
• Explore ways to effectively present rules to the user
• Make SEER more interactive
• On-the-fly suggestions as users highlight an example
• Delay between giving examples and system
suggesting rules
Thank you
Questions?
VINERy
Appendix: Related Works
• Previous works in rule synthesis
• Wrapper induction e.g. RAPIER, STALKER, etc
• Rule language depends on learning the text surrounding target extraction
• Limited to structure
• Synthesis
• FlashExtract
• DSL learns from surrounding text of target extraction.
• Doesn’t take take account of text semantics
• Seer doesn’t depend on structure of document; learns
the actual content of extraction
Appendix: Refinements – Conflict Resolution
• Conflicts
• Gray out extraction if:
• Its rules are all rejected.
• Its rules have all been accepted.
Refinements Covering Rules
50.5 percent R1, R2
40 dollars R2
25% R1, R2, R3
R1 and R2 are rejected: Gray out “40 dollars”
Appendix: Details on Experimental Results
• Repeated Measures ANOVA
• User programming experience no impact on performance
• Duration: a significant main effect of tool (F1,41 = 19.42, p < 0.04), especially for
easy and hard tasks.
• F1: a significant main effect of tool (F1,13 = 5.195, p = 0.04), especially for hard
tasks.
• Precision: True Positives / (True Positives + False Positives)
• Recall (coverage): True Positives / (True Positives + False Negatives)
• F1 score: 2 x (Precision x Recall)/(Precision + Recall)

More Related Content

Similar to SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
Mike Felch
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
neelakandan2001kpm
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
Databricks
 

Similar to SEER: Auto-Generating Information Extraction Rules from User-Specified Examples (20)

Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Ontologies mining using association rules
Ontologies mining using association rulesOntologies mining using association rules
Ontologies mining using association rules
 
Datamingse
DatamingseDatamingse
Datamingse
 
Chapter03
Chapter03Chapter03
Chapter03
 
PSU Web 2013: User Research Power Tool: Pareto Principle Based User Research
PSU Web 2013: User Research Power Tool: Pareto Principle Based User ResearchPSU Web 2013: User Research Power Tool: Pareto Principle Based User Research
PSU Web 2013: User Research Power Tool: Pareto Principle Based User Research
 
TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)TeelTech - Advancing Mobile Device Forensics (online version)
TeelTech - Advancing Mobile Device Forensics (online version)
 
Introduction to Learning Information Services
Introduction to Learning Information ServicesIntroduction to Learning Information Services
Introduction to Learning Information Services
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
Big Data Day LA 2016/ Use Case Driven track - Reliable Media Reporting in an ...
 
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
SCRIMPS-STD: Test Automation Design Principles - and asking the right questions!
 
Data analytics in computer networking
Data analytics in computer networkingData analytics in computer networking
Data analytics in computer networking
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)IT Operation Analytic for security- MiSSconf(sp1)
IT Operation Analytic for security- MiSSconf(sp1)
 
Developing in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit editionDeveloping in R - the contextual Multi-Armed Bandit edition
Developing in R - the contextual Multi-Armed Bandit edition
 
Transferring Software Testing Tools to Practice
Transferring Software Testing Tools to PracticeTransferring Software Testing Tools to Practice
Transferring Software Testing Tools to Practice
 
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
Neo4j Theory and Practice - Tareq Abedrabbo @ GraphConnect London 2013
 
FlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at HumanaFlorenceAI: Reinventing Data Science at Humana
FlorenceAI: Reinventing Data Science at Humana
 
Correlation does not mean causation
Correlation does not mean causationCorrelation does not mean causation
Correlation does not mean causation
 
CNIT 152 11 Analysis Methodology
CNIT 152 11 Analysis MethodologyCNIT 152 11 Analysis Methodology
CNIT 152 11 Analysis Methodology
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City
 

Recently uploaded

Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
chiefasafspells
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
VictoriaMetrics
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 

SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

  • 1. SEER: Auto-Generating Information Extraction Rules from User-Specified Examples Maeda F. Hanafi, Azza Abouzied, Laura Chiticariu, Yunyao Li
  • 2. Outline • Motivation • SEER • Demo • Technical Overview • Experimental Results • Conclusion/Future Works
  • 3. What is Information Extraction (IE)? FBI Press Release Dataset Document Crime Changes January2016.html offenses increase of 5.3 percent January2016.html violent crimes in the US decreased 0.6 percent January2016.html arson crimes decreased 1.1 percent July2016.html property crimes rose 4.5 percent July2016.html 5 percent rise in race-based hate crimes July2016.html offenses decreased 1.6 percent January2015.html arson crimes increased 0.3 percent January2015.html offenses rose 1.6 percent January2015.html property crimes rose 7.0 percent
  • 4. SEER Existing Works in Information Extraction Hand-written extraction rules or scripts • Cons: • Labor-intensive • Skill-dependent • Pros: • Debuggable results • Customizable rules • Often used by industry Supervised machine learning models • Cons: • Large labeled datasets • Requires retraining for domain adaptation • Non-debuggable results • Pros: • Quick to generate results • Popularly researched
  • 5. Why we built SEER? SEER • Pros: • Quick to generate results/rules • Debuggable results • Customizable rules • Additional pros: • No programming knowledge required • Easily encode domain knowledge; mimics how a human builds rules ML PL Heuristics designed by observing developers
  • 6. Goal and Scope • SEER • Helps end users build rules that specify and extract fixed patterns of text • Automates the process of manually building rules • Reduces human-errors in constructing rules
  • 8. Applications of IE • Social Media Analytics Extract user ratings from Yelp reviews. • Financial Analytics Extract key people and their relationships with major US companies from SEC data. • Healthcare Analytics Extract data from health forms to understand the side effects of certain drugs.
  • 10. Visual Annotation Query Language (VAQL) Rules A rule = sequence of primitives; primitives extract 1+ tokens 1) Pre-built 5.4 percentP: Percentage percentL: ‘percent’2) Literal %D: {percent, %}3) Dictionary percentR: [A-Za-z]+4) Regex any tokenT: 0-15) Token gap 5.4 percentP: Integer T: 0-2 L: ‘percent’Rule: 5.4 percent 5 . 4 percenthas 4 tokens
  • 11. How do we learn rules? 1. Enumerate possible primitives per example token 2. Assign scores to primitives 3. Generate rules in a tree for each example 4. Intersect trees to find the set of rules that extract all the given examples • Maintain disjoint trees for non-intersectable examples. 5. Prune, rank, and suggest rules to user
  • 12. 1) Enumerate possible primitives per token 5 P: Percentage L: ‘5’ R: [0-9]+ percent 5 percent L: ‘percent’ R: [A-Za-z]+ P: Number up 5 percent upExample: Tokens: Primitives: T: 0-1 L: ‘up’ R: [A-Za-z]+ P: Integer
  • 13. 2) Assign scores to primitives For semantic tokens: L: ‘-’ P: City Pre-builts≺ Dictionary LiteralToken gap Regex For syntactic tokens: Dictionary Literal Token gap Regex L: ‘Dubai’T: 0-1Dubai : - : T: 0-1 ≺ ≺ ≺ ≺ ≺ 1.00.0 1.0
  • 14. 3) Generate rules in a tree 5 percent L: ‘5’ = 0.4 P: Percentage = 1.0 R: [A-Za-z]+ = 0.2 L: ‘percent’ = 0.4 R: [A-Za-z]+ = 0.2 Tokens: 5 percentExample: Tree: Rule: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4 R: [0-9]+ = 0.2 L: ‘percent’ = 0.4
  • 15. 4) Intersect trees L: ‘5’ = 0.4 P: Percentage = 1.0 R: [A-Za-z]+ = 0.2 L: ‘percent’ = 0.4 R: [A-Za-z]+ = 0.2 5 percentExample: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4 L: ‘6’ = 0.4 P: Percentage = 1.0 R: symbols = 0.2 L: ‘%’ = 0.4 R: symbols = 0.2 R: [0-9]+ = 0.2 L: ‘%’ = 0.4 D: {5, 6} = 0.4 P: Percentage = 1.0 R: [0-9]+ = 0.2 6%Example: 6% Intersect: 5 percent D: {percent, %} = 0.4 D: {percent, %} = 0.4
  • 16. 5) Prune, rank, and suggest rules to user Suggested Rules: P: Percentage = 1.0 = 1.0 5 percentExamples: 6% D: {5, 6} = 0.4 = 0.4D: {percent, %} = 0.4 R: [0-9]+ = 0.2 = 0.3D: {percent, %} = 0.4
  • 17. Experiments Do users build better rules in a shorter amount of time with SEER than without it?
  • 18. Experiment Setup • 13 participants • All participants used both VINERy and SEER in random order • VINERy: IBM’s drag-and-drop interface for building rules in VAQL manually. • Users were trained for each tool • Participants built rules to complete information extraction tasks on different datasets • Order of tools and dataset is randomized to minimize learning effects • Measured: duration, precision, recall
  • 19. Experiment Tasks • IE tasks were designed for novice users to complete in 10 minutes Task Type Dataset Task Example Easy: one primitive IBM Extract all currency amounts $4.5 million FBI Extract all percentages 4.5 percent Medium: multiple primitives IBM Extract cash flow and Original Equipment Management revenues cash flow of $4 million FBI Extract all percentage increases or decreases in offenses offenses rose 4.4 percent Hard: Disjunction of Two rules IBM Extract all yearly quarter phrases fourth-quarter 2004 FBI Extract all population ranges 10 to 10,000 in population
  • 20. Users build better rules faster with SEER than without it • Rules in SEER were better: • Duration in creating rules • Correctness of rule extractions • Black lines: Mean duration of completing task • Gray boxes: 95% mean confidence intervals • Higher F1 scores indicate that user’s rules have higher precision and recall scores
  • 21. Summary • SEER combines the advantages of ML and PL • SEER: an end-to-end IE workflow • Synthesizes rules from user examples • Uses heuristics to encode domain knowledge: mimics how a human builds rules • Experiments • SEER helps user build more accurate and precise rules quickly
  • 22. Future Works • Explore ways to effectively present rules to the user • Make SEER more interactive • On-the-fly suggestions as users highlight an example • Delay between giving examples and system suggesting rules
  • 25. Appendix: Related Works • Previous works in rule synthesis • Wrapper induction e.g. RAPIER, STALKER, etc • Rule language depends on learning the text surrounding target extraction • Limited to structure • Synthesis • FlashExtract • DSL learns from surrounding text of target extraction. • Doesn’t take take account of text semantics • Seer doesn’t depend on structure of document; learns the actual content of extraction
  • 26. Appendix: Refinements – Conflict Resolution • Conflicts • Gray out extraction if: • Its rules are all rejected. • Its rules have all been accepted. Refinements Covering Rules 50.5 percent R1, R2 40 dollars R2 25% R1, R2, R3 R1 and R2 are rejected: Gray out “40 dollars”
  • 27. Appendix: Details on Experimental Results • Repeated Measures ANOVA • User programming experience no impact on performance • Duration: a significant main effect of tool (F1,41 = 19.42, p < 0.04), especially for easy and hard tasks. • F1: a significant main effect of tool (F1,13 = 5.195, p = 0.04), especially for hard tasks. • Precision: True Positives / (True Positives + False Positives) • Recall (coverage): True Positives / (True Positives + False Negatives) • F1 score: 2 x (Precision x Recall)/(Precision + Recall)

Editor's Notes

  1. Slide 1 Maeda: Hello, my name is Maeda Hanafi. I am a PhD student at New York University Abu Dhabi NYUAD. I will be presenting work that I have been doing with my advisor Azza Abouzied at NYUAD and with Laura Chiticariu and Yunyao Li from IBM Almaden. SEER is an information extraction tool that I built to help users extract information from data.
  2. Slide 2 - Outline Maeda: I will motivate the idea behind SEER and show how SEER extracts data. I will show that our user studies prove that SEER helps users build quality rules in a short amount time.
  3. Slide 3 - What is IE? Maeda: What is information extraction? Information extraction or IE helps automatically extract data from large collections of documents. Suppose our a journalist, Diane Sawyer, wishes to analyze changes in crime rates over the years. She has a dataset of FBI press releases. She has to extract all changes in crime rates from each document … Click … to produce a table similar to this that can be easily further analyzed. She may also wish to automate this process such that if her dataset of crime press releases grows, she can easily extract crime rate data.
  4. Slide 4 - Existing Works in IE Diane has several options to automatically extract this data. We can map the existing IE techniques on a spectrum. Click. Maeda: On one end, the popular method in IE research is machine learning or ML. ML models are trainable and given data will adapt automatically reducing the need for manual effort. The problems with ML, specifically supervised ML models, is that you need a large labeled dataset. Diane might have have to provide a large data set of labeled document annotations to train a classifier to extract the data she wants. A sufficiently large dataset is sometimes hard to obtain. Moreover, in the case of errors, it is hard to debug why a particular record is extracted because many ML methods produce complex statistical models that are hard to interpret by humans. But once you have a model trained, it is quick and easy to obtain results. Click. Maeda: On the other end of the spectrum, Diane can learn a data extraction scripting language and construct her own extraction rules or hire a developer to do so. However, the problem is that it is labor intensive because there are many scripts or rules involved. Or for a non-programmer, like our journalist, she needs to spend time learning to program, which involves a steep learning curve. Overall, this is a time-consuming process. But, the pros of using hand written rules is that they are easily customizable. The outputs are also debuggable. Because of their debuggability and maintainability, this is the preferred approach by industry. Click. Maeda: SEER, however, tries to achieve the best of both worlds.
  5. Slide 5 – Why we built SEER? Maeda: SEER maintains the benefits from the ML and PL methods. SEER quickly generates rules and results. SEER learns programs from a small number of examples and doesn’t require large datasets. The programs are customizable. SEER doesn’t require users to have a programming background. Click Maeda: There are additional benefits. SEER gets the benefits of rule based methods by using heuristics to encode the thinking process of how a human user would build rules. SEER uses heuristics to encode domain knowledge that would otherwise be impossible when using ML methods. The design behind SEER are driven and motivated by how users build rules.
  6. Slide 6 – Goal Maeda: SEER helps end users build rules that specify and extract fixed patterns of text. SEER speeds up the process of building rules from scratch and reduces the human errors involved in rule construction by suggesting rules based on user-provided examples of text extractions. SEER allows an end user who doesn’t know how to program, is not familiar with ML or does not have a sufficiently large labeled dataset, to highlight examples of text he or she wishes to extract and quickly get extraction results and extraction scripts for further data extraction.
  7. Slide 7 - Demo Maeda: I will show how our journalist, Diane, can extract data using SEER. Play video. User highlights positive examples. Maeda: Right now Diane is highlighting examples of text she wishes to extract. User clicking on “Suggest Rules!” button. Maeda: Once the user highlights some examples, she tells SEER to learn some rules. Rules in seer specify patterns of text to extract. On the right hand side, the suggested rules appear. As she selects rules she can see what those rules extract in the document itself. User highlights more positive examples to update the rules. Maeda: Diane can highlight more positive examples to update the rules to include missing extractions. She is highlighting violent crimes in addition to offenses. Seer updates the rule set to include the missing extractions. User highlights a green highlighted text from document as negative example. Maeda: She notices that there is a wrong extraction. She can highlight it as a negative example. SEER will update the suggested rules to not extract the wrong texts. User accepts or rejects refinements. Maeda: Right now Diane has two rules left. She can analyze each rule or filter the rules. She is filtering rules by accepting or rejecting refinements. When she answers the refinements, seer immediately figures out which of the two rules is the right rule. These refinements are computed from the first few extractions that differentiate the rules. User selects the right rule. User hovers over the export and edit buttons. Maeda: Once she finds a satisfactory rule, she can export the rules or edit the rules directly in the interface.
  8. Slide 3 - Applications of IE Maeda: IE is not limited to FBI datasets. For social media analytics, businesses wish to extract ratings and reviews from social media. In the financial industry, analysts want to extract the relationships between key people and major US corporations. In healthcare analytics, doctors want to extract data from health forms to understand the side effects of certain drugs. SEER can help all users who work with such text datasets extract relevant data from them.
  9. Slide 8 - Technical Overview Maeda: So how does SEER work?
  10. Slide 9 - Visual Annotation Query Language SEER suggests rules in Visual annotation query language (VAQL). VAQL is IBM’s visual programming language for IE. VAQL rules extract text from documents pre-tokenized by VAQL. For example 5.4 percent is tokenized into its individual tokens. 5, point, 4, and percent. Maeda: Rules in VAQL are built from sequences of primitives. Primitives extract one or more tokens. There are five primitives in VAQL. Pre-builts extract entities, such as organizations, integers, percentages, phone numbers, and much more. Literals extract the exact string. So literal percent extracts only the word percent. Dictionary extracts texts that appear in the dictionary. Dictionary containing the word percent and the percent symbol can extract the word percent and the percentage symbol. Regular expressions. We have a library of regular expressions. In this example, this regex captures tokens containing letters. Token gap can extract any token. Token gap 0 to 1 skips over 0 to 1 words. Token gaps only appear in the middle of a sequence. For example, the token gap appears in this sequence. This sequence of primitives or rule can capture percentages. It begins with an integer, skips over zero to two tokens until the next primitive match occurs, which in this case is the literal percent.
  11. Slide 10 - How do we learn rules? Maeda: How does SEER learn rules? For each example given from the user, break down each example into its individual tokens. Then, for each token, SEER enumerates the possible primitives. Then, SEER generates candidate rules that extract the example and represents the rules as compact tree structures. Then the trees are intersected. Intersection finds the set of rules that extracts all the positive examples. Non-intersectable examples result in disjoint trees. After intersection, the rules are pruned and the rules that aren’t pruned will be suggested to the user. I will go through each step. The details of the learning example are in the paper. But I will walk-through an example.
  12. Slide 11 - 1) Enumerate possible primitives per token Maeda: Step 1 is to enumerate possible primitives per token. The positive example in bold, 5 percent up, is tokenized into its individual tokens. 5, percent, and up. Then, the possible primitives are enumerated per token. 5 can be captured by two pre-builts, number and integer, and a literal and a regex. Percent can be captured by token gaps, along with the other types of primitives shown on the slide. 5 percent can be captured by percentage prebuilt. Note that prebuilts can capture multiple tokens.
  13. Slide 12 - 2) Assign scores to primitives Maeda: Scores are assigned to primitives in order to guide the search. Depending on the nature of the token, we apply one of two scoring functions. In the first scoring function, if the token represents a semantic entity, i.e. it has a natural meaning, we assign scores such that prebuilts are preferred to literals and dictionaries, which are preferred to token gaps and regexs. In the second scoring function, if the token represents syntax, for example a dash, a colon, filler articles like ‘a’ or ‘the’, we assign scores such that token gaps or regexes are preferred to literals and dictionaries. Our observations of how rule developers choose primitives when constructing rules guides our scoring choices. We conducted a post-hoc study that shows a strong agreement between our scoring heuristics and the preferences of rule developers for different primitives. Based on this preference, we map it to numerical values as primitive scores.
  14. Slide 13 - 3) Generate rules in a tree Maeda: Finally, the primitives are combined into rules. We use a tree structure to compactly store the possible rules. A tree is generated for each example. Trees are an efficient data structure used to hold candidate rules. This is a tree for the example 5 percent. Each level in the tree belongs to a token. Traversing from root to leaf will result in a rule. Here we traverse from the regex and then to the literal percent. The rule is a regex followed by literal percent.
  15. Slide 14 - 4) Intersect trees Maeda: After a tree is generated per example, the trees are intersected. Intersection finds the set of rules that extracts all the positive examples. In this example, there are two positive examples, 5 percent and 6%, along with their individual trees. The tree on the bottom is the intersected tree . The intersected tree contains rules capturing both 5 percent and 6%. For example, the dictionary of 5 and 6 followed by the dictionary percent and percent symbol is a rule in the intersected tree.
  16. Slide 15 - 5) Prune, rank, and suggest rules to the user Maeda: The last step is to prune the rules. In longer examples, there may be hundreds of rules generated. We can't show all of them to the user. Pruning is based off of the ranking the rules by their rule score and their composing primitives. The rule score is the average of the primitives scores. The rules will be presented in this order to the user. In the paper, we describe how we deal with negative examples and how we construct multiple rules if a single rule cannot capture all positive examples.
  17. Slide 16 - Experiments Maeda: I will now present the experimental results. We wanted to know: Do users build more accurate and precise rules in a shorter duration with SEER than without rule suggestions? We wanted to figure out whether SEER actually helps user out.
  18. Slide 17 - Experiment Setup Maeda: In our experiment to test whether rule suggestions are effective, we compared rules that were built manually versus rules that were picked from rule suggestions. VINERy is IBM’s commercial tool for manually building rules in VAQL which is the same language as SEER. All participants used both VINERy and SEER in random order. Users were trained for each tool. There were two datasets. The order of the tools and the dataset was picked randomly to minimize the learning effects. We measured duration the user took to complete the task and the precision and recall of the rules they built.
  19. Slide 18 - Experiment tasks Maeda: Participants built rules to complete information extraction tasks within a time limit of 10 minutes. Tasks were presented from easy to hard. The task complexity depended on the minimum required amount of rules needed to complete the tasks. Easy required the user to build a rule containing at least one pre-built. For example, for the currency amount tasks which requires them to extract currency amount, they needed to use a currency amount prebuilt, which captures all money phrases. Medium required at least one rule. Hard required at least two rules to capture the variations in the target extractions.
  20. Slide 19 - Users build better rules faster with SEER than without it Maeda: This is a chart of our results. We measured duration, precision, and recall of the rules. Users using SEER are the orange dots and users using VINERy are the blue dots. The horizontal axis is the duration. The vertical axis refers to different task complexities. The black in the middle indicate the mean duration. Notice that SEER users completed the tasks on average faster than without SEER. The F1 scores are noted on the right hand side. The formulas for the F1 are the standard formulas for IE, which is essentially the harmonic mean of the precision and recall. For all task types, the F1 score is higher in SEER than without SEER. The differences in task completion times are statistically significant for hard and easy tasks. The difference in F1 scores is statistically significant for hard tasks.
  21. Slide 20 - Summary Maeda: To conclude, SEER combines the advantages of machine learning and rule-based approaches. SEER is an iterative learning model for extraction rules based on a small number of examples from users. SEER is designed based off how a human user would actually build rules. Finally, our experiments results show that SEER helps user build more accurate and precise rules quickly.
  22. Slide 21 - Future Works Maeda: In our future works, we plan to study effective ways of presenting rules to the user. We plan to make SEER more interactive. Right now users highlight examples and then clicks on the learn button, and then the user must wait until rules are suggested. It would be better if the user highlights an example and SEER builds the rules on the go and the primitives are automatically generated and suggested immdediately. With this interaction, users don't have to highlight examples and wait for rules from the backend.
  23. Slide 22 - Questions? Maeda: Thank you.
  24. Whisk. Semantics and syntax but still language dependent on surrounding target extraction. Training set must contain all possible permutations Rapier. Semantics too. Dependent on learning the delimiters/surrouding text Stalker. Loads data into some Tree like structure