SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

SEER:
Auto-Generating Information
Extraction Rules from User-Specified
Examples
Maeda F. Hanafi, Azza
Abouzied, Laura Chiticariu,
Yunyao Li

Outline
• Motivation
• SEER
• Demo
• Technical Overview
• Experimental Results
• Conclusion/Future Works

What is Information Extraction (IE)?
FBI Press Release Dataset
Document Crime Changes
January2016.html offenses increase of 5.3 percent
January2016.html violent crimes in the US decreased 0.6 percent
January2016.html arson crimes decreased 1.1 percent
July2016.html property crimes rose 4.5 percent
July2016.html 5 percent rise in race-based hate crimes
July2016.html offenses decreased 1.6 percent
January2015.html arson crimes increased 0.3 percent
January2015.html offenses rose 1.6 percent
January2015.html property crimes rose 7.0 percent

SEER
Existing Works in Information Extraction
Hand-written extraction rules
or scripts
• Cons:
• Labor-intensive
• Skill-dependent
• Pros:
• Debuggable results
• Customizable rules
• Often used by industry
Supervised machine learning
models
• Cons:
• Large labeled datasets
• Requires retraining for
domain adaptation
• Non-debuggable results
• Pros:
• Quick to generate results
• Popularly researched

Why we built SEER?
SEER
• Pros:
• Quick to generate results/rules
• Debuggable results
• Customizable rules
• Additional pros:
• No programming knowledge required
• Easily encode domain knowledge;
mimics how a human builds rules
ML PL
Heuristics
designed by
observing
developers

Goal and Scope
• SEER
• Helps end users build rules that specify and extract
fixed patterns of text
• Automates the process of manually building rules
• Reduces human-errors in constructing rules

Applications of IE
• Social Media Analytics
Extract user ratings from Yelp reviews.
• Financial Analytics
Extract key people and their relationships with major US
companies from SEC data.
• Healthcare Analytics
Extract data from health forms to understand the side effects
of certain drugs.

Visual Annotation Query Language (VAQL) Rules
A rule = sequence of primitives; primitives extract 1+ tokens
1) Pre-built 5.4 percentP: Percentage
percentL: ‘percent’2) Literal
%D: {percent, %}3) Dictionary
percentR: [A-Za-z]+4) Regex
any tokenT: 0-15) Token gap
5.4 percentP: Integer T: 0-2 L: ‘percent’Rule:
5.4 percent 5 . 4 percenthas 4 tokens

How do we learn rules?
1. Enumerate possible primitives per example token
2. Assign scores to primitives
3. Generate rules in a tree for each example
4. Intersect trees to find the set of rules that extract all
the given examples
• Maintain disjoint trees for non-intersectable
examples.
5. Prune, rank, and suggest rules to user

1) Enumerate possible primitives per token
5
P: Percentage
L: ‘5’
R: [0-9]+
percent
5 percent
L: ‘percent’
R: [A-Za-z]+
P: Number
up
5 percent upExample: Tokens: Primitives:
T: 0-1
L: ‘up’
R: [A-Za-z]+
P: Integer

2) Assign scores to primitives
For semantic tokens:
L: ‘-’
P: City
Pre-builts≺ Dictionary
LiteralToken gap
Regex
For syntactic tokens:
Dictionary
Literal Token gap
Regex
L: ‘Dubai’T: 0-1Dubai :
- : T: 0-1
≺
≺ ≺
≺
≺
1.00.0
1.0

3) Generate rules in a tree
5 percent
L: ‘5’ = 0.4
P: Percentage = 1.0
R: [A-Za-z]+ = 0.2
L: ‘percent’ = 0.4
R: [A-Za-z]+ = 0.2
Tokens:
5 percentExample:
Tree:
Rule: R: [0-9]+ = 0.2 L: ‘percent’ = 0.4
R: [0-9]+ = 0.2

4) Intersect trees
L: ‘5’ = 0.4
P: Percentage = 1.0
R: [A-Za-z]+ = 0.2
R: [A-Za-z]+ = 0.2
5 percentExample:
R: [0-9]+ = 0.2
L: ‘6’ = 0.4
P: Percentage = 1.0
R: symbols = 0.2
L: ‘%’ = 0.4
R: symbols = 0.2
R: [0-9]+ = 0.2
L: ‘%’ = 0.4
D: {5, 6} = 0.4
P: Percentage = 1.0
R: [0-9]+ = 0.2
6%Example:
6%
Intersect:
5 percent
D: {percent, %} = 0.4
D: {percent, %} = 0.4

5) Prune, rank, and suggest rules to user
Suggested Rules:
P: Percentage = 1.0 = 1.0
5 percentExamples: 6%
D: {5, 6} = 0.4 = 0.4D: {percent, %} = 0.4
R: [0-9]+ = 0.2 = 0.3D: {percent, %} = 0.4

Experiments
Do users build better rules in a shorter amount
of time with SEER than without it?

Experiment Setup
• 13 participants
• All participants used both VINERy and SEER in random
order
• VINERy: IBM’s drag-and-drop interface for building
rules in VAQL manually.
• Users were trained for each tool
• Participants built rules to complete information extraction
tasks on different datasets
• Order of tools and dataset is randomized to minimize
learning effects
• Measured: duration, precision, recall

Experiment Tasks
• IE tasks were designed for novice users to complete in 10
minutes
Task Type Dataset Task Example
Easy:
one primitive
IBM Extract all currency amounts $4.5 million
FBI Extract all percentages 4.5 percent
Medium:
multiple
primitives
IBM
Extract cash flow and Original
Equipment Management revenues
cash flow of $4
million
FBI
Extract all percentage increases or
decreases in offenses
offenses rose 4.4
percent
Hard:
Disjunction of
Two rules
IBM Extract all yearly quarter phrases fourth-quarter 2004
FBI Extract all population ranges
10 to 10,000 in
population

Users build better rules faster with SEER than
without it
• Rules in SEER were better:
• Duration in creating rules
• Correctness of rule extractions
• Black lines: Mean duration of completing task
• Gray boxes: 95% mean confidence intervals
• Higher F1 scores indicate that user’s rules have higher precision and recall
scores

Summary
• SEER combines the advantages of ML and PL
• SEER: an end-to-end IE workflow
• Synthesizes rules from user examples
• Uses heuristics to encode domain knowledge: mimics
how a human builds rules
• Experiments
• SEER helps user build more accurate and precise
rules quickly

Future Works
• Explore ways to effectively present rules to the user
• Make SEER more interactive
• On-the-fly suggestions as users highlight an example
• Delay between giving examples and system
suggesting rules

Appendix: Related Works
• Previous works in rule synthesis
• Wrapper induction e.g. RAPIER, STALKER, etc
• Rule language depends on learning the text surrounding target extraction
• Limited to structure
• Synthesis
• FlashExtract
• DSL learns from surrounding text of target extraction.
• Doesn’t take take account of text semantics
• Seer doesn’t depend on structure of document; learns
the actual content of extraction

Appendix: Refinements – Conflict Resolution
• Conflicts
• Gray out extraction if:
• Its rules are all rejected.
• Its rules have all been accepted.
Refinements Covering Rules
50.5 percent R1, R2
40 dollars R2
25% R1, R2, R3
R1 and R2 are rejected: Gray out “40 dollars”

Appendix: Details on Experimental Results
• Repeated Measures ANOVA
• User programming experience no impact on performance
• Duration: a significant main effect of tool (F1,41 = 19.42, p < 0.04), especially for
easy and hard tasks.
• F1: a significant main effect of tool (F1,13 = 5.195, p = 0.04), especially for hard
tasks.
• Precision: True Positives / (True Positives + False Positives)
• Recall (coverage): True Positives / (True Positives + False Negatives)
• F1 score: 2 x (Precision x Recall)/(Precision + Recall)

SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

Recommended

Recommended

More Related Content

Similar to SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

Similar to SEER: Auto-Generating Information Extraction Rules from User-Specified Examples (20)

Recently uploaded

Recently uploaded (20)

SEER: Auto-Generating Information Extraction Rules from User-Specified Examples

Editor's Notes