Maeda: Hello, my name is Maeda Hanafi. I am a PhD student at New York University Abu Dhabi NYUAD. I will be presenting work that I have been doing with my advisor Azza Abouzied at NYUAD and with Laura Chiticariu and Yunyao Li from IBM Almaden. SEER is an information extraction tool that I built to help users extract information from data.
Slide 2 - Outline
Maeda: I will motivate the idea behind SEER and show how SEER extracts data. I will show that our user studies prove that SEER helps users build quality rules in a short amount time.
Slide 3 - What is IE?
Maeda: What is information extraction? Information extraction or IE helps automatically extract data from large collections of documents. Suppose our a journalist, Diane Sawyer, wishes to analyze changes in crime rates over the years. She has a dataset of FBI press releases. She has to extract all changes in crime rates from each document …
… to produce a table similar to this that can be easily further analyzed.
She may also wish to automate this process such that if her dataset of crime press releases grows, she can easily extract crime rate data.
Slide 4 - Existing Works in IE
Diane has several options to automatically extract this data. We can map the existing IE techniques on a spectrum.
Maeda: On one end, the popular method in IE research is machine learning or ML. ML models are trainable and given data will adapt automatically reducing the need for manual effort. The problems with ML, specifically supervised ML models, is that you need a large labeled dataset. Diane might have have to provide a large data set of labeled document annotations to train a classifier to extract the data she wants. A sufficiently large dataset is sometimes hard to obtain. Moreover, in the case of errors, it is hard to debug why a particular record is extracted because many ML methods produce complex statistical models that are hard to interpret by humans. But once you have a model trained, it is quick and easy to obtain results.
Maeda: On the other end of the spectrum, Diane can learn a data extraction scripting language and construct her own extraction rules or hire a developer to do so. However, the problem is that it is labor intensive because there are many scripts or rules involved. Or for a non-programmer, like our journalist, she needs to spend time learning to program, which involves a steep learning curve. Overall, this is a time-consuming process. But, the pros of using hand written rules is that they are easily customizable. The outputs are also debuggable. Because of their debuggability and maintainability, this is the preferred approach by industry. Click.
Maeda: SEER, however, tries to achieve the best of both worlds.
Slide 5 – Why we built SEER?
Maeda: SEER maintains the benefits from the ML and PL methods. SEER quickly generates rules and results. SEER learns programs from a small number of examples and doesn’t require large datasets. The programs are customizable. SEER doesn’t require users to have a programming background.
Maeda: There are additional benefits. SEER gets the benefits of rule based methods by using heuristics to encode the thinking process of how a human user would build rules. SEER uses heuristics to encode domain knowledge that would otherwise be impossible when using ML methods. The design behind SEER are driven and motivated by how users build rules.
Slide 6 – Goal
Maeda: SEER helps end users build rules that specify and extract fixed patterns of text. SEER speeds up the process of building rules from scratch and reduces the human errors involved in rule construction by suggesting rules based on user-provided examples of text extractions. SEER allows an end user who doesn’t know how to program, is not familiar with ML or does not have a sufficiently large labeled dataset, to highlight examples of text he or she wishes to extract and quickly get extraction results and extraction scripts for further data extraction.
Slide 7 - Demo
Maeda: I will show how our journalist, Diane, can extract data using SEER.
User highlights positive examples.
Maeda: Right now Diane is highlighting examples of text she wishes to extract.
User clicking on “Suggest Rules!” button.
Maeda: Once the user highlights some examples, she tells SEER to learn some rules. Rules in seer specify patterns of text to extract. On the right hand side, the suggested rules appear. As she selects rules she can see what those rules extract in the document itself.
User highlights more positive examples to update the rules.
Maeda: Diane can highlight more positive examples to update the rules to include missing extractions. She is highlighting violent crimes in addition to offenses. Seer updates the rule set to include the missing extractions.
User highlights a green highlighted text from document as negative example.
Maeda: She notices that there is a wrong extraction. She can highlight it as a negative example. SEER will update the suggested rules to not extract the wrong texts.
User accepts or rejects refinements.
Maeda: Right now Diane has two rules left. She can analyze each rule or filter the rules. She is filtering rules by accepting or rejecting refinements. When she answers the refinements, seer immediately figures out which of the two rules is the right rule. These refinements are computed from the first few extractions that differentiate the rules.
User selects the right rule. User hovers over the export and edit buttons.
Maeda: Once she finds a satisfactory rule, she can export the rules or edit the rules directly in the interface.
Slide 3 - Applications of IE
Maeda: IE is not limited to FBI datasets. For social media analytics, businesses wish to extract ratings and reviews from social media. In the financial industry, analysts want to extract the relationships between key people and major US corporations. In healthcare analytics, doctors want to extract data from health forms to understand the side effects of certain drugs.
SEER can help all users who work with such text datasets extract relevant data from them.
Slide 8 - Technical Overview
Maeda: So how does SEER work?
Slide 9 - Visual Annotation Query Language
SEER suggests rules in Visual annotation query language (VAQL). VAQL is IBM’s visual programming language for IE. VAQL rules extract text from documents pre-tokenized by VAQL. For example 5.4 percent is tokenized into its individual tokens. 5, point, 4, and percent.
Maeda: Rules in VAQL are built from sequences of primitives. Primitives extract one or more tokens. There are five primitives in VAQL.
Pre-builts extract entities, such as organizations, integers, percentages, phone numbers, and much more.
Literals extract the exact string. So literal percent extracts only the word percent.
Dictionary extracts texts that appear in the dictionary. Dictionary containing the word percent and the percent symbol can extract the word percent and the percentage symbol.
Regular expressions. We have a library of regular expressions. In this example, this regex captures tokens containing letters.
Token gap can extract any token. Token gap 0 to 1 skips over 0 to 1 words. Token gaps only appear in the middle of a sequence.
For example, the token gap appears in this sequence. This sequence of primitives or rule can capture percentages. It begins with an integer, skips over zero to two tokens until the next primitive match occurs, which in this case is the literal percent.
Slide 10 - How do we learn rules?
Maeda: How does SEER learn rules? For each example given from the user, break down each example into its individual tokens. Then, for each token, SEER enumerates the possible primitives. Then, SEER generates candidate rules that extract the example and represents the rules as compact tree structures. Then the trees are intersected. Intersection finds the set of rules that extracts all the positive examples. Non-intersectable examples result in disjoint trees. After intersection, the rules are pruned and the rules that aren’t pruned will be suggested to the user. I will go through each step. The details of the learning example are in the paper. But I will walk-through an example.
Slide 11 - 1) Enumerate possible primitives per token
Maeda: Step 1 is to enumerate possible primitives per token. The positive example in bold, 5 percent up, is tokenized into its individual tokens. 5, percent, and up. Then, the possible primitives are enumerated per token. 5 can be captured by two pre-builts, number and integer, and a literal and a regex. Percent can be captured by token gaps, along with the other types of primitives shown on the slide. 5 percent can be captured by percentage prebuilt. Note that prebuilts can capture multiple tokens.
Slide 12 - 2) Assign scores to primitives
Maeda: Scores are assigned to primitives in order to guide the search. Depending on the nature of the token, we apply one of two scoring functions. In the first scoring function, if the token represents a semantic entity, i.e. it has a natural meaning, we assign scores such that prebuilts are preferred to literals and dictionaries, which are preferred to token gaps and regexs. In the second scoring function, if the token represents syntax, for example a dash, a colon, filler articles like ‘a’ or ‘the’, we assign scores such that token gaps or regexes are preferred to literals and dictionaries.
Our observations of how rule developers choose primitives when constructing rules guides our scoring choices. We conducted a post-hoc study that shows a strong agreement between our scoring heuristics and the preferences of rule developers for different primitives. Based on this preference, we map it to numerical values as primitive scores.
Slide 13 - 3) Generate rules in a tree
Maeda: Finally, the primitives are combined into rules. We use a tree structure to compactly store the possible rules. A tree is generated for each example. Trees are an efficient data structure used to hold candidate rules. This is a tree for the example 5 percent. Each level in the tree belongs to a token. Traversing from root to leaf will result in a rule. Here we traverse from the regex and then to the literal percent. The rule is a regex followed by literal percent.
Slide 14 - 4) Intersect trees
Maeda: After a tree is generated per example, the trees are intersected. Intersection finds the set of rules that extracts all the positive examples. In this example, there are two positive examples, 5 percent and 6%, along with their individual trees. The tree on the bottom is the intersected tree . The intersected tree contains rules capturing both 5 percent and 6%. For example, the dictionary of 5 and 6 followed by the dictionary percent and percent symbol is a rule in the intersected tree.
Slide 15 - 5) Prune, rank, and suggest rules to the user
Maeda: The last step is to prune the rules. In longer examples, there may be hundreds of rules generated. We can't show all of them to the user. Pruning is based off of the ranking the rules by their rule score and their composing primitives. The rule score is the average of the primitives scores. The rules will be presented in this order to the user.
In the paper, we describe how we deal with negative examples and how we construct multiple rules if a single rule cannot capture all positive examples.
Slide 16 - Experiments
Maeda: I will now present the experimental results. We wanted to know: Do users build more accurate and precise rules in a shorter duration with SEER than without rule suggestions? We wanted to figure out whether SEER actually helps user out.
Slide 17 - Experiment Setup
Maeda: In our experiment to test whether rule suggestions are effective, we compared rules that were built manually versus rules that were picked from rule suggestions. VINERy is IBM’s commercial tool for manually building rules in VAQL which is the same language as SEER. All participants used both VINERy and SEER in random order. Users were trained for each tool. There were two datasets. The order of the tools and the dataset was picked randomly to minimize the learning effects. We measured duration the user took to complete the task and the precision and recall of the rules they built.
Slide 18 - Experiment tasks
Maeda: Participants built rules to complete information extraction tasks within a time limit of 10 minutes. Tasks were presented from easy to hard. The task complexity depended on the minimum required amount of rules needed to complete the tasks. Easy required the user to build a rule containing at least one pre-built. For example, for the currency amount tasks which requires them to extract currency amount, they needed to use a currency amount prebuilt, which captures all money phrases. Medium required at least one rule. Hard required at least two rules to capture the variations in the target extractions.
Slide 19 - Users build better rules faster with SEER than without it
Maeda: This is a chart of our results. We measured duration, precision, and recall of the rules. Users using SEER are the orange dots and users using VINERy are the blue dots. The horizontal axis is the duration. The vertical axis refers to different task complexities. The black in the middle indicate the mean duration. Notice that SEER users completed the tasks on average faster than without SEER. The F1 scores are noted on the right hand side. The formulas for the F1 are the standard formulas for IE, which is essentially the harmonic mean of the precision and recall. For all task types, the F1 score is higher in SEER than without SEER. The differences in task completion times are statistically significant for hard and easy tasks. The difference in F1 scores is statistically significant for hard tasks.
Slide 20 - Summary
Maeda: To conclude, SEER combines the advantages of machine learning and rule-based approaches. SEER is an iterative learning model for extraction rules based on a small number of examples from users. SEER is designed based off how a human user would actually build rules. Finally, our experiments results show that SEER helps user build more accurate and precise rules quickly.
Slide 21 - Future Works
Maeda: In our future works, we plan to study effective ways of presenting rules to the user. We plan to make SEER more interactive. Right now users highlight examples and then clicks on the learn button, and then the user must wait until rules are suggested. It would be better if the user highlights an example and SEER builds the rules on the go and the primitives are automatically generated and suggested immdediately. With this interaction, users don't have to highlight examples and wait for rules from the backend.
Slide 22 - Questions? Maeda: Thank you.
Whisk. Semantics and syntax but still language dependent on surrounding target extraction. Training set must contain all possible permutations Rapier. Semantics too. Dependent on learning the delimiters/surrouding text Stalker. Loads data into some Tree like structure
SEER: Auto-Generating Information Extraction Rules from User-Specified Examples
Extraction Rules from User-Specified
Maeda F. Hanafi, Azza
Abouzied, Laura Chiticariu,
What is Information Extraction (IE)?
FBI Press Release Dataset
Document Crime Changes
January2016.html offenses increase of 5.3 percent
January2016.html violent crimes in the US decreased 0.6 percent
January2016.html arson crimes decreased 1.1 percent
July2016.html property crimes rose 4.5 percent
July2016.html 5 percent rise in race-based hate crimes
July2016.html offenses decreased 1.6 percent
January2015.html arson crimes increased 0.3 percent
January2015.html offenses rose 1.6 percent
January2015.html property crimes rose 7.0 percent
Existing Works in Information Extraction
Hand-written extraction rules
• Debuggable results
• Customizable rules
• Often used by industry
Supervised machine learning
• Large labeled datasets
• Requires retraining for
• Non-debuggable results
• Quick to generate results
• Popularly researched
Why we built SEER?
• Quick to generate results/rules
• Debuggable results
• Customizable rules
• Additional pros:
• No programming knowledge required
• Easily encode domain knowledge;
mimics how a human builds rules
Goal and Scope
• Helps end users build rules that specify and extract
fixed patterns of text
• Automates the process of manually building rules
• Reduces human-errors in constructing rules
Applications of IE
• Social Media Analytics
Extract user ratings from Yelp reviews.
• Financial Analytics
Extract key people and their relationships with major US
companies from SEC data.
• Healthcare Analytics
Extract data from health forms to understand the side effects
of certain drugs.
How do we learn rules?
1. Enumerate possible primitives per example token
2. Assign scores to primitives
3. Generate rules in a tree for each example
4. Intersect trees to find the set of rules that extract all
the given examples
• Maintain disjoint trees for non-intersectable
5. Prune, rank, and suggest rules to user
1) Enumerate possible primitives per token
5 percent upExample: Tokens: Primitives:
2) Assign scores to primitives
For semantic tokens:
For syntactic tokens:
Literal Token gap
L: ‘Dubai’T: 0-1Dubai :
- : T: 0-1
Do users build better rules in a shorter amount
of time with SEER than without it?
• 13 participants
• All participants used both VINERy and SEER in random
• VINERy: IBM’s drag-and-drop interface for building
rules in VAQL manually.
• Users were trained for each tool
• Participants built rules to complete information extraction
tasks on different datasets
• Order of tools and dataset is randomized to minimize
• Measured: duration, precision, recall
• IE tasks were designed for novice users to complete in 10
Task Type Dataset Task Example
IBM Extract all currency amounts $4.5 million
FBI Extract all percentages 4.5 percent
Extract cash flow and Original
Equipment Management revenues
cash flow of $4
Extract all percentage increases or
decreases in offenses
offenses rose 4.4
IBM Extract all yearly quarter phrases fourth-quarter 2004
FBI Extract all population ranges
10 to 10,000 in
Users build better rules faster with SEER than
• Rules in SEER were better:
• Duration in creating rules
• Correctness of rule extractions
• Black lines: Mean duration of completing task
• Gray boxes: 95% mean confidence intervals
• Higher F1 scores indicate that user’s rules have higher precision and recall
• SEER combines the advantages of ML and PL
• SEER: an end-to-end IE workflow
• Synthesizes rules from user examples
• Uses heuristics to encode domain knowledge: mimics
how a human builds rules
• SEER helps user build more accurate and precise
• Explore ways to effectively present rules to the user
• Make SEER more interactive
• On-the-fly suggestions as users highlight an example
• Delay between giving examples and system
Appendix: Related Works
• Previous works in rule synthesis
• Wrapper induction e.g. RAPIER, STALKER, etc
• Rule language depends on learning the text surrounding target extraction
• Limited to structure
• DSL learns from surrounding text of target extraction.
• Doesn’t take take account of text semantics
• Seer doesn’t depend on structure of document; learns
the actual content of extraction
Appendix: Refinements – Conflict Resolution
• Gray out extraction if:
• Its rules are all rejected.
• Its rules have all been accepted.
Refinements Covering Rules
50.5 percent R1, R2
40 dollars R2
25% R1, R2, R3
R1 and R2 are rejected: Gray out “40 dollars”
Appendix: Details on Experimental Results
• Repeated Measures ANOVA
• User programming experience no impact on performance
• Duration: a significant main effect of tool (F1,41 = 19.42, p < 0.04), especially for
easy and hard tasks.
• F1: a significant main effect of tool (F1,13 = 5.195, p = 0.04), especially for hard
• Precision: True Positives / (True Positives + False Positives)
• Recall (coverage): True Positives / (True Positives + False Negatives)
• F1 score: 2 x (Precision x Recall)/(Precision + Recall)