SlideShare a Scribd company logo
Declarative
Multilingual Information Extraction
with SystemT
Alan Akbik and Laura Chiticariu
IBM Research | Almaden
Institute of Computer Science (ICSI), Berkeley, California
11/17/2016
Outline
• Use casesUse casesUse casesUse cases
• RequirementsRequirementsRequirementsRequirements
• SystemTSystemTSystemTSystemT
• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes
• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL
2
Case Study 1: Social Data Analysis
Text
Analytics
Product catalog, Customer Master Data, …
Social Media
• Products
Interests360o Profile
• Relationships
• Personal
Attributes
• Life
Events
Statistical
Analysis,
Report Gen.
Bank
6
23%
Bank
5
20%
Bank
4
21%
Bank
3
5%
Bank
2
15%
Bank
1
15%
ScaleScaleScaleScale
450M+ tweets a day,
100M+ consumers, …
BreadthBreadthBreadthBreadth
Buzz, intent, sentiment, life
events, personal atts, …
ComplexityComplexityComplexityComplexity
Sarcasm, wishful thinking, …
Customer 360º
3
Case Study 2: Machine Analysis
Web Server
Logs
CustomCustom
Application
Logs
Database
Logs
Raw Logs
Parsed
Log
Records
Parse Extract
Linkage
Information
End-to-End
Application
Sessions
Integrate
Graphical Models
(Anomaly detection)
Analyze
5 min 0.85
CauseCause EffectEffect DelayDelay CorrelationCorrelation
10 min 0.62
Correlation Analysis
• Every customer has unique
components with unique log record
formats
• Need to quickly customize all
stages of analysis for these custom
log data sources
4
Outline
• Use casesUse casesUse casesUse cases
• RequirementsRequirementsRequirementsRequirements
• SystemTSystemTSystemTSystemT
• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes
• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL
5
Enterprise Requirements
• ExpressivityExpressivityExpressivityExpressivity
– Need to express complex algorithms for a variety of tasks and data sources
• ScalabilityScalabilityScalabilityScalability
– Large number of documents and large-size documents
• TransparencyTransparencyTransparencyTransparency
– Every customer’s data and problems are unique in some way
– Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors
6
Expressivity Example: Different Kinds of Parses
We are raising our tablet forecast.
S
are
NP
We
S
raising
NP
forecastNP
tablet
DET
our
subj
obj
subj pred
Natural Language Machine Log
Dependency
Tree
Oct 1 04:12:24 9.1.1.3 41865:
%PLATFORM_ENV-1-DUAL_PWR: Faulty
internal power supply B detected
Time Oct 1 04:12:24
Host 9.1.1.3
Process 41865
Category
%PLATFORM_ENV-1-
DUAL_PWR
Message
Faulty internal power
supply B detected
7
88
Expressivity Example: Fact Extraction (Tables)
Singapore 2012 Annual Report
(136 pages PDF)
Identify note breaking down
Operating expenses line item,
and extract opex components
Identify line item for Operating
expenses from Income statement
(financial table in pdf document)
• Social Media: Twitter alone has 500M+ messages / day;
1TB+ per day
• Financial Data: SEC alone has 20M+ filings, several TBs of
data, with documents range from few KBs to few MBs
• Machine Data: One application server under moderate load
at medium logging level 1GB of logs per day
Scalability Example
Scalability Example
9
Transparency Example: Need to incorporate in-situ data
Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16%
FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.
I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better.
Customer or
competitor?
Good or bad?
Entity of interest
I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same
toy $10 cheaper!
10
Outline
• Use casesUse casesUse casesUse cases
• RequirementsRequirementsRequirementsRequirements
• SystemTSystemTSystemTSystemT
• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes
• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL
11
AQL Language
Optimizer
Operator
Runtime
Specify extractor
semantics declaratively
Choose efficient
execution plan that
implements semantics
Example AQL Extractor
Fundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & Theorems
• Expressivity:Expressivity:Expressivity:Expressivity: The class of extraction tasks expressible in AQL is a strict superset of that
expressible through cascaded regular automata.
• Performance:Performance:Performance:Performance: For any acyclic token-based finite state transducer T, there exists an
operator graph G such that evaluating T and G has the same computational complexity.
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N, Sentence S
where
Follows(P.name, N.number, 0, 30)
and Contains(S.sentence, P.name)
and Contains(S.sentence, N.number)
and ContainsRegex(/b(phone|at)b/,
SpanBetween(P.name, N.number));
Within asinglesentence
<Person> <PhoneNum>
0-30chars
Contains“phone” or “at”
Within asinglesentence
<Person> <PhoneNum>
0-30chars
Contains“phone” or “at”
SystemTSystemTSystemTSystemT: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006
More than 35 papers across ACL, EMNLP, VLDB, PODS and SIGMOD
SystemT Architecture
12
Expressivity: Rich Set of Operators
Deep Syntactic Parsing ML Training & Scoring
Core Operators
Tokenization Parts of Speech Dictionaries
Semantic Role Labels
Language to express NLP Algorithms AQL
.Regular
Expressions
HTML/XML
Detagging
Span
Operations
Relational
Operations
Aggregation
Operations
13
package com.ibm.avatar.algebra.util.sentence;
import java.io.BufferedWriter;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.regex.Matcher;
public class SentenceChunker
{
private Matcher sentenceEndingMatcher = null;
public static BufferedWriter sentenceBufferedWriter = null;
private HashSet<String> abbreviations = new HashSet<String> ();
public SentenceChunker ()
{
}
/** Constructor that takes in the abbreviations directly. */
public SentenceChunker (String[] abbreviations)
{
// Generate the abbreviations directly.
for (String abbr : abbreviations) {
this.abbreviations.add (abbr);
}
}
/**
* @param doc the document text to be analyzed
* @return true if the document contains at least one sentence boundary
*/
public boolean containsSentenceBoundary (String doc)
{
String origDoc = doc;
/*
* Based on getSentenceOffsetArrayList()
*/
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
String candidate = /*
* Looks at the last character of the String. If this last
* character is part of an abbreviation (as detected by
* REGEX) then the sentenceString is not a fullSentence and
* "false” is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder)
&& isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
// sentences.addElement(candidate.trim().replaceAll("n", "
// "));
// sentenceArrayList.add(new Integer(currentOffset + boundary
// + 1));
// currentOffset += boundary + 1;
// Found a sentence boundary. If the boundary is the last
// character in the string, we don't consider it to be
// contained within the string.
int baseOffset = currentOffset + boundary + 1;
if (baseOffset < origDoc.length ()) {
// System.err.printf("Sentence ends at %d of %dn",
// baseOffset, origDoc.length());
return true;
}
else {
return false;
}
}
// origDoc.substring(0,currentOffset));
// doc = doc.substring(boundary + 1);
doc = remainder;
}
}
while (boundary != -1);
// If we get here, didn't find any boundaries.
return false;
}
public ArrayList<Integer> getSentenceOffsetArrayList (String doc)
{
ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> ();
// String origDoc = doc;
// int dotpos, quepos, exclpos, newlinepos;
int boundary;
int currentOffset = 0;
sentenceArrayList.add (new Integer (0));
do {
/* Get the next tentative boundary for the sentenceString */
setDocumentForObtainingBoundaries (doc);
boundary = getNextCandidateBoundary ();
if (boundary != -1) {
String candidate = doc.substring (0, boundary + 1);
String remainder = doc.substring (boundary + 1);
/*
* Looks at the last character of the String. If this last character
* is part of an abbreviation (as detected by REGEX) then the
* sentenceString is not a fullSentence and "false" is returned
*/
// while (!(isFullSentence(candidate) &&
// doesNotBeginWithCaps(remainder))) {
while (!(doesNotBeginWithPunctuation (remainder) &&
isFullSentence (candidate))) {
/* Get the next tentative boundary for the sentenceString */
int nextBoundary = getNextCandidateBoundary ();
if (nextBoundary == -1) {
break;
}
boundary = nextBoundary;
candidate = doc.substring (0, boundary + 1);
remainder = doc.substring (boundary + 1);
}
if (candidate.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + boundary + 1));
currentOffset += boundary + 1;
}
// origDoc.substring(0,currentOffset));
doc = remainder;
}
}
while (boundary != -1);
if (doc.length () > 0) {
sentenceArrayList.add (new Integer (currentOffset + doc.length ()));
}
sentenceArrayList.trimToSize ();
return sentenceArrayList;
}
private void setDocumentForObtainingBoundaries (String doc)
{
sentenceEndingMatcher = SentenceConstants.
sentenceEndingPattern.matcher (doc);
}
private int getNextCandidateBoundary ()
{
if (sentenceEndingMatcher.find ()) {
return sentenceEndingMatcher.start ();
}
else
return -1;
}
private boolean doesNotBeginWithPunctuation (String remainder)
{
Matcher m = SentenceConstants.punctuationPattern.matcher (remainder);
return (!m.find ());
}
private String getLastWord (String cand)
{
Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand);
if (lastWordMatcher.find ()) {
return lastWordMatcher.group ();
}
else {
return "";
}
}
/*
* Looks at the last character of the String. If this last character is
* par of an abbreviation (as detected by REGEX)
* then the sentenceString is not a fullSentence and "false" is returned
*/
private boolean isFullSentence (String cand)
{
// cand = cand.replaceAll("n", " "); cand = " " + cand;
Matcher validSentenceBoundaryMatcher =
SentenceConstants.validSentenceBoundaryPattern.matcher (cand);
if (validSentenceBoundaryMatcher.find ()) return true;
Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand);
if (abbrevMatcher.find ()) {
return false; // Means it ends with an abbreviation
}
else {
// Check if the last word of the sentenceString has an entry in the
// abbreviations dictionary (like Mr etc.)
String lastword = getLastWord (cand);
if (abbreviations.contains (lastword)) { return false; }
}
return true;
}
}
Java Implementation of Sentence Boundary Detection
Expressivity: AQL vs. Custom Java Code
create dictionary AbbrevDict from file
'abbreviation.dict’;
create view SentenceBoundary as
select R.match as boundary
from ( extract regex /(([.?!]+s)|(ns*n))/
on D.text as match from Document D ) R
where
Not(ContainsDict('AbbrevDict',
CombineSpans(LeftContextTok(R.match, 1),R.match)));
Equivalent AQL Implementation
14
Transparency in SystemT
PersonPhone
• Clear and simple provenanceprovenanceprovenanceprovenance
Person PhonePerson
Anna at James St. office (555-5555) .
[Liu et al., VLDB’10]
Ease of comprehension
Ease of debugging
Ease of enhancement
• Explains output data in terms of input data, intermediate data, and the transformation
• For predicate-based rule languages (e.g., SQL), can be computed automatically
create view PersonPhone as
select P.name as person, N.number as phone
from Person P, PhoneNumber N
where Follows(P.name, N.number, 0, 30);
personpersonpersonperson phonephonephonephone
Anna 555-5555
James 555-5555
Person
namenamenamename
Anna
James
PhoneNumber
numbernumbernumbernumber
555-5555t1
t2
t3
t1 ∧ t3
t2 ∧ t3
15
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts:
– Embeddable Models in AQL
– Learning using AQL as target language
16
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts
– Embeddable Models in AQL
• Model Training and Scoring integration with SystemML
• Semantic Role LabelingSemantic Role LabelingSemantic Role LabelingSemantic Role Labeling [Akbik et al., ACL’15,’16, EMNLP’16, COLING’16]
• Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15]
– Learning using AQL as target language
17
Machine Learning in SystemT
• AQL provides a foundation of transparency
• Next step: Add machine learning without losing transparency
• Major machine learning efforts
– Embeddable Models in AQL
– Learning using AQL as target language
• Low-level features: regular expressions [Li et al., EMNLP’08],
dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13]
• Rule induction [Nagesh et al., EMNLP’12]
• Rule refinement [Liu et al., VLDB’10]
• Ongoing research
18
Web Tools Overview
Ease of
Programming
Ease of
Sharing
19
Outline
• Use casesUse casesUse casesUse cases
• RequirementsRequirementsRequirementsRequirements
• SystemTSystemTSystemTSystemT
• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes
• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL
20
SystemT in Universities
2013 2015 2016 2017
• UC Santa Cruz
(full Graduate class)
2014
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Aalborg, Denmark (Grad)
• UIUC (Grad)
• U. Maryland Baltimore County
(Undergrad)
• UC Irvine (Grad)
• NYU Abu-Dhabi (Undergrad)
• U. Washington (Grad)
• U. Oregon (Undergrad)
• U. Maryland Baltimore County
(Undergrad)
•
• UC Santa Cruz, 3 lectures
in one Grad class
SystemT MOOC
21
AlonAlonAlonAlon Halevy, November 2015Halevy, November 2015Halevy, November 2015Halevy, November 2015
"The tutorial of System T was an important hands-on component of a Ph.D
course on Structured Data on the Web at the University of Aalborg in
Denmark taught by Alon Halevy. The tutorial did a great job giving the
students the feeling for the challenges involved in extracting structured data
from text.“
ShimeiShimeiShimeiShimei Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016
"The Information Extraction with SystemT class, which was offered for the
first time at UMBC, has been a wonderful experience for me and my
students. I really enjoyed teaching this course. Students were also very
enthusiastic. Based on the feedback from my students, they have learned a lot.
Some of my students even want me to offer this class every semester.“
SystemT in Universities: Testimonials
22
• Professors like teaching it
• Students do interesting projects and have fun!
• Identify character relationships in Games of Thrones
• Determine the cause and treatment for PTSD patients
• Extract job experiences and skills from resumes
• …
SystemT in Universities: Summary
23
SystemT MOOC
• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students
• Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist
e.g., University students (grad, undergrad), industry professionals
• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes
TA0105: Text Analytics: Getting Results with SystemT
– Basics of IE with SystemT, learning to write extractors using the web tool,
learn AQL
– 5 lectures + 1 hands-on lab + final exam
TA0106EN: Advanced Text Analytics: Getting Results with SystemT
– Advanced material on declarative approach and SystemT optimization
– 2 lectures + final exam
• Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students
• Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist
e.g., University students (grad, undergrad), industry professionals
• Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes
TA0105: Text Analytics: Getting Results with SystemT
– Basics of IE with SystemT, learning to write extractors using the web tool,
learn AQL
– 5 lectures + 1 hands-on lab + final exam
TA0106EN: Advanced Text Analytics: Getting Results with SystemT
– Advanced material on declarative approach and SystemT optimization
– 2 lectures + final exam
24
Outline
• Use casesUse casesUse casesUse cases
• RequirementsRequirementsRequirementsRequirements
• SystemTSystemTSystemTSystemT
• SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes
• Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL
25
• 7,102 known languages
• 23 most spoken language
> 4.1 billion people
https://walizahid.com/2015/06/the-worlds-most-spoken-languages/
• Our goal: Enable
Text Analytics
over Multilingual
Text
The World’s Most Spoken Languages
26
ProblemProblemProblemProblem: High costs to handle
multilingual data
English
NLP
English Text
Analytics
German
NLP
German Text
Analytics
Current StatusCurrent StatusCurrent StatusCurrent Status: - One language at a time;
- Query language and data must be in the same language
SeparateSeparateSeparateSeparate parsers /
features for each
language
French
NLP
French Text
Analytics
Chinese
NLP
Chinese Text
Analytics
SeparateSeparateSeparateSeparate text analyics
must be developed
EnglishEnglishEnglishEnglish
Text Data
ChineseChineseChineseChinese
Text Data
FrenchFrenchFrenchFrench
Text Data
GermanGermanGermanGerman
Text Data
CrosslingualCrosslingualCrosslingualCrosslingual Text AnalyticsText AnalyticsText AnalyticsText Analytics: Parse and develop against unified abstraction;
Unified
NLP
Crosslingual
TextAnalytics
Semantic parserSemantic parserSemantic parserSemantic parser
• arbitrary languages
• unified feature setunified feature setunified feature setunified feature set
Universal extractorsUniversal extractorsUniversal extractorsUniversal extractors
• Develop onceonceonceonce
• Run for all languages
Proposed Approach: Crosslingual Text Analytics
27
SRL focuses on semantics, not syntax
Useful for many applications (Shen and
Lapata, 2007; Maqsud et al., 2014)
FrameFrameFrameFrame semanticssemanticssemanticssemantics crosslinguallycrosslinguallycrosslinguallycrosslingually valid?valid?valid?valid?
Dirk broke the window with a hammer.
Break.01A0 A1 A2
The window was broken by Dirk.
The window broke.
A1 Break.01 A0
A1 Break.01
Break.01
A0 – Breaker
A1 – Thing broken
A2 – Instrument
A3 – Pieces
Break.15
A0 – Journalist,
exposer
A1 – Story,
thing exposed
• Problem: What type of labels are valid across languages?
• Lexical, morphological and syntactic labels differ greatly
• But shallow semantic labelsshallow semantic labelsshallow semantic labelsshallow semantic labels remain stable
Semantic Role Labeling
28
The station bought Cosby reruns for record prices
Buy.01A0
spurred dollar buying by Japanese institutions.
Spur.01 A1 Buy.01 A0
The company payed 2 million US$ to buy .
Corpus of annotated text dataFrame set
A1 A3
Buy.01
A0 – Buyer
A1 – Thing bought
A2 – Seller
A3 – Price paid
A4 – Benefactive
Pay.01
A0 – Payer
A1 – Money
A2 – Being payed
A3 – Commodity
Buy.01Pay.01A0 A1
• English SRL
– FrameNet
– PropBank
SRL Resources
29
• English SRL
– FrameNet
– PropBank
30
1. Limited coverage
2. Language-specific formalisms
• Other languages
– Chinese Proposition Bank
– Hindi Proposition Bank
– German FrameNet
– French? Spanish? Russian? Arabic? …
订购
A0: buyer
A1: commodity
A2: seller
order.02
A0: orderer
A1: thing ordered
A2: benefactive, ordered-for
A3: source
SRL for other languages?
We want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labels
WhatsApp was bought by Facebook
Facebook hat WhatsApp gekauft
Facebook a achété WhatsApp
buy.01
Facebook WhatsApp
Crosslingual representationMultilingual input text
Buy.01 A0A1
Buy.01A1A0
Buy.01A0 A1
Shared Frames Across Languages
31
• 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
32
Our Goal
• Generate SRL resources for many other languages
• Shared frame set
• Minimal effort
Frame set
Buy.01
A0 – Buyer
A1 – Thing bought
A2 – Seller
A3 – Price paid
A4 – Benefactive
Pay.01
A0 – Payer
A1 – Money
A2 – Being payed
A3 – Commodity
Il faut qu‘ il y ait des responsables
Need.01A0
Je suis responsable pour le chaos
Be.01A1 A2 AM-PRD
Les services postaux ont achété des
Be.01 A2A1
Buy.01A0
Corpus of annotated text data
33
Example: TV subtitles
Generating Universal Proposition Banks
Current practice:
Annotator training
(months)
Annotation
(years)
Must be
repeated for
each language!
Idea: Annotation projection in parallel corpora
Das würde ich für einen Dollar kaufen German subtitles
I would buy that for a dollar! English subtitles
PRICEBUYER ITEM
BUYERITEM
I would buy that for a dollar!
PRICE
projection
Das würde ich für einen Dollar kaufenDas würde ich für einen Dollar kaufen
34
However: Projections Not Always Possible
We need to hold people responsible
Il faut qu‘ il y ait des responsables
English sentence:
Target sentence:
Hold.01
A0
A1 A3
Need.01
Hold.01
Incorrect projection!
There need to be those responsible
A1
Main error sources:Main error sources:Main error sources:Main error sources:
• Translation shift
• Source-language SRL errors
35
• Two-step process of filtering and bootstrapping (ACL ‘15)
– Filters to detect translation shift, block projections (more precision at cost
of recall)
– Bootstrap learning to increase recall
– Generate 7 universal proposition banks7 universal proposition banks7 universal proposition banks7 universal proposition banks from 3 language groups
Improving Projection Quality
36
• Online demo for SRL in 9 languages (ACL ‘16)
Train Multilingual SRL
37
Low-resource Languages & Crowdsourcing
• Apply approach to low-resource languages
(EMNLP ‘16)
– Bengali, Malayalam, Tamil
– Fewer sources of parallel data
– Almost no NLP: No syntactic parsing,
lemmatization etc.
• Crowdsourcing for data curation
38
Multilingual Aliasing
• Expert curation of frame mappings (COLING ‘16)
• ProblemProblemProblemProblem: Target language frame lexicon automatically
generated from alignments
– False frames
– Redundant frames
• Two-step expert curation approach
– Step 1: Filter
– Step 2: Merge
• Example: drehen
– “to turn” (rotate)
– “to film”
39
Original
drehen.
04 SHOOT.03
record on film
A0: videographer
A1: subject filmed
A2: medium
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
drehen.
02 TURN.02
transformation
A0: causer of
transformation
A1: thing changing
A2: end state
A3: start state
record on film
A0: recorder
A1: thing filmed
drehen.
03 FILM.01
incorrect frame,
filtered
Merged
synonyms,
merged
Predicate: drehendrehendrehendrehen
Filtered
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
record on film
A0: recorder
A1: thing filmed
drehen.
03 FILM.01
drehen.
01 TURN.01
rotation
A0: turner
A1: thing turning
Am: direction
drehen.
03
FILM.01
SHOOT.03
record on film
A0: recorder
A1: thing filmed
record on film
A0: videographer
A1: subject filmed
A2: medium
drehen.
04 SHOOT.03
Multilingual Aliasing: Example
40
Multilingual Aliasing
• Expert curation of frame mappings (COLING ‘16)
– 60 person hours per language
– Chinese, French and German
– More salient senses, improved quality
41
Automatic Generation of Proposition Banks
• Project English PropBank frame and role annotation onto
target language sentences
• A variety of strategies to increase quality:
– Filtering of translation shift
– Crowdsourced data curation
– Expert frame alignments
• Result: Generated universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8
languageslanguageslanguageslanguages
• Currently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source release
42
• 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
43
• ProblemProblemProblemProblem: Current SRL error-prone
• Strong local bias, many lowStrong local bias, many lowStrong local bias, many lowStrong local bias, many low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize
Instance-based Semantic Role Labeling
44
• ProblemProblemProblemProblem: Current SRL error-prone
• Strong local biasStrong local biasStrong local biasStrong local bias
• Many lowMany lowMany lowMany low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize
Instance-based Semantic Role Labeling
• K-SRL: Instance-based
learning
– K-nearest neighbors
• Composite features
• Significantly outperform
all other current systems
In-domain Out-of-domain
45
• 1. Generation of Proposition Banks for Arbitrary Languages
• 2. Development of a SotA Semantic Role Labeler
• 3. Application to Use Cases
Research
46
ActionAPI - Demo
47
Thank you!
• For more information…
– Visit our website:Visit our website:Visit our website:Visit our website:
http://ibm.co/1Cdm1Mj
– Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:
http://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEv
– Learning at your own pace:Learning at your own pace:Learning at your own pace:Learning at your own pace: SystemT MOOCSystemT MOOCSystemT MOOCSystemT MOOC
– Contact usContact usContact usContact us
• akbika@us.ibm.com
• chiti@us.ibm.com
SystemT Team:
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy,
Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu
48
Declarative AQL language
Development Environment
AQL Extractor
create view ProductMention as
select ...
from ...
where ...
create view IntentToBuy as
select ...
from ...
where ... Cost-based
optimization
...
SystemT: Declarative Information Extraction
Discovery tools for AQL
development
SystemT RuntimeSystemT Runtime
Input
Documents
Extracted
Objects
Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive,
efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with
expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency.
Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and
novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products.
AQL: a declarative language that can be used to build extractors
outperforming the state-of-the-arts [ACL’10]
Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16]
A suite of novel development tooling leveraging
machine learning and HCI [EMNLP’08, VLDB’10,
ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13,
SIGMOD’13, ACL’13,VLDB’15, NAACL’15]
Cost-based optimization for
text-centric operations
[ICDE’08, ICDE’11, FPL’13, FPL’14]
Highly embeddable runtime
with high-throughput and
small memory footprint.
[SIGMOD Record’09, SIGMOD’09]
For details and
Online Class visit:
https://ibm.biz/BdF4GQ
49

More Related Content

What's hot

From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
CareerBuilder.com
 
A practical introduction to SADI semantic Web services and HYDRA query tool
A practical introduction to SADI semantic Web services and HYDRA query toolA practical introduction to SADI semantic Web services and HYDRA query tool
A practical introduction to SADI semantic Web services and HYDRA query tool
Alexandre Riazanov
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET Journal
 
Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.legal2
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise search
Yunyao Li
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
Lars Marius Garshol
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
Yunyao Li
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
MatthewOverstreet2
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
Yunyao Li
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
Dabbal Singh Mahara
 
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
RuleML
 
C+++
C+++C+++
Deductive Database
Deductive DatabaseDeductive Database
Deductive Database
Nikhil Sharma
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
Ted Sullivan
 
Paper id 42201608
Paper id 42201608Paper id 42201608
Paper id 42201608
IJRAT
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisions
Michael BENESTY
 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-secaise2013vlc
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
Yunyao Li
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt

What's hot (20)

From keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic searchFrom keyword-based search to language-agnostic semantic search
From keyword-based search to language-agnostic semantic search
 
A practical introduction to SADI semantic Web services and HYDRA query tool
A practical introduction to SADI semantic Web services and HYDRA query toolA practical introduction to SADI semantic Web services and HYDRA query tool
A practical introduction to SADI semantic Web services and HYDRA query tool
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 
Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.Graph Based Methods For The Representation And Analysis Of.
Graph Based Methods For The Representation And Analysis Of.
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise search
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
Vespa, A Tour
Vespa, A TourVespa, A Tour
Vespa, A Tour
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Deductive databases
Deductive databasesDeductive databases
Deductive databases
 
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
Datalog+-Track Introduction & Reasoning on UML Class Diagrams via Datalog+-
 
C+++
C+++C+++
C+++
 
Deductive Database
Deductive DatabaseDeductive Database
Deductive Database
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 
Paper id 42201608
Paper id 42201608Paper id 42201608
Paper id 42201608
 
NLP applied to French legal decisions
NLP applied to French legal decisionsNLP applied to French legal decisions
NLP applied to French legal decisions
 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-se
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Icpc11c.ppt
Icpc11c.pptIcpc11c.ppt
Icpc11c.ppt
 

Viewers also liked

Ciencia ficica
Ciencia ficica Ciencia ficica
Ciencia ficica cabral78
 
10.09.16 Patrice D'Eramo - Leaders as Teachers
10.09.16 Patrice D'Eramo - Leaders as Teachers10.09.16 Patrice D'Eramo - Leaders as Teachers
10.09.16 Patrice D'Eramo - Leaders as Teachers
Patrice D'Eramo
 
Решения в области запорной арматуры
Решения в области запорной арматурыРешения в области запорной арматуры
Решения в области запорной арматуры
BALTIC CLEANTECH ALLIANCE
 
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealandAuckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
Baydepdulich
 
Media Studies - Evaluation Question 3
Media Studies - Evaluation Question 3Media Studies - Evaluation Question 3
Media Studies - Evaluation Question 3
Aidan Emberton
 
Unidad 1. elementos de la comunicacion. (anita)
Unidad 1. elementos de la comunicacion. (anita)Unidad 1. elementos de la comunicacion. (anita)
Unidad 1. elementos de la comunicacion. (anita)mielesortizangela18
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
Laura Chiticariu
 
Trabalho de encerramento da Oficina Mídias Sociais
Trabalho de encerramento da Oficina Mídias SociaisTrabalho de encerramento da Oficina Mídias Sociais
Trabalho de encerramento da Oficina Mídias Sociais
Fundação Educandário "Cel. Quito Junqueira"
 
Canal de Compliance - Canal de Denuncias
Canal de Compliance - Canal de DenunciasCanal de Compliance - Canal de Denuncias
Canal de Compliance - Canal de Denuncias
Juan Bosco Gimeno
 

Viewers also liked (14)

Ciencia ficica
Ciencia ficica Ciencia ficica
Ciencia ficica
 
Danhsach key
Danhsach keyDanhsach key
Danhsach key
 
10.09.16 Patrice D'Eramo - Leaders as Teachers
10.09.16 Patrice D'Eramo - Leaders as Teachers10.09.16 Patrice D'Eramo - Leaders as Teachers
10.09.16 Patrice D'Eramo - Leaders as Teachers
 
Alaa c.v (2)
Alaa c.v (2)Alaa c.v (2)
Alaa c.v (2)
 
Norton internet security clave
Norton internet security claveNorton internet security clave
Norton internet security clave
 
Assam Tribune Epaper
Assam Tribune EpaperAssam Tribune Epaper
Assam Tribune Epaper
 
Решения в области запорной арматуры
Решения в области запорной арматурыРешения в области запорной арматуры
Решения в области запорной арматуры
 
Uso de Teamviewer
Uso de TeamviewerUso de Teamviewer
Uso de Teamviewer
 
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealandAuckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
Auckland grammar school – trường phổ thông nam sinh hàng đầu new zealand
 
Media Studies - Evaluation Question 3
Media Studies - Evaluation Question 3Media Studies - Evaluation Question 3
Media Studies - Evaluation Question 3
 
Unidad 1. elementos de la comunicacion. (anita)
Unidad 1. elementos de la comunicacion. (anita)Unidad 1. elementos de la comunicacion. (anita)
Unidad 1. elementos de la comunicacion. (anita)
 
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
SystemT: Declarative Information Extraction (invited talk at MIT CSAIL)
 
Trabalho de encerramento da Oficina Mídias Sociais
Trabalho de encerramento da Oficina Mídias SociaisTrabalho de encerramento da Oficina Mídias Sociais
Trabalho de encerramento da Oficina Mídias Sociais
 
Canal de Compliance - Canal de Denuncias
Canal de Compliance - Canal de DenunciasCanal de Compliance - Canal de Denuncias
Canal de Compliance - Canal de Denuncias
 

Similar to Declarative Multilingual Information Extraction with SystemT

Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
jimfuller2009
 
2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation
diannepatricia
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
Databricks
 
2011-03-29 London - drools
2011-03-29 London - drools2011-03-29 London - drools
2011-03-29 London - droolsGeoffrey De Smet
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
Elasticsearch
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
M Waleed Kadous
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
John Allspaw
 
Performance Analysis of Leading Application Lifecycle Management Systems for...
Performance Analysis of Leading Application Lifecycle  Management Systems for...Performance Analysis of Leading Application Lifecycle  Management Systems for...
Performance Analysis of Leading Application Lifecycle Management Systems for...
Daniel van den Hoven
 
are algorithms really a black box
are algorithms really a black boxare algorithms really a black box
are algorithms really a black box
Ansgar Koene
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Miguel Gallardo
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
Richard Garris
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
Ian Foster
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Ravi Okade
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
Asfak Mahamud
 
Beyond PITS, Functional Principles for Software Architecture
Beyond PITS, Functional Principles for Software ArchitectureBeyond PITS, Functional Principles for Software Architecture
Beyond PITS, Functional Principles for Software Architecture
Jayaram Sankaranarayanan
 
Integration&SOA_v0.2
Integration&SOA_v0.2Integration&SOA_v0.2
Integration&SOA_v0.2Sergey Popov
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
InfinIT - Innovationsnetværket for it
 
Using Pony for Fintech
Using Pony for FintechUsing Pony for Fintech
Using Pony for Fintech
C4Media
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
HostedbyConfluent
 

Similar to Declarative Multilingual Information Extraction with SystemT (20)

Implementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoCImplementing the Genetic Algorithm in XSLT: PoC
Implementing the Genetic Algorithm in XSLT: PoC
 
2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation2015 07-30-sysetm t.short.no.animation
2015 07-30-sysetm t.short.no.animation
 
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobilNLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
NLP-Focused Applied ML at Scale for Global Fleet Analytics at ExxonMobil
 
2011-03-29 London - drools
2011-03-29 London - drools2011-03-29 London - drools
2011-03-29 London - drools
 
Get full visibility and find hidden security issues
Get full visibility and find hidden security issuesGet full visibility and find hidden security issues
Get full visibility and find hidden security issues
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...Resilience Engineering: A field of study, a community, and some perspective s...
Resilience Engineering: A field of study, a community, and some perspective s...
 
Performance Analysis of Leading Application Lifecycle Management Systems for...
Performance Analysis of Leading Application Lifecycle  Management Systems for...Performance Analysis of Leading Application Lifecycle  Management Systems for...
Performance Analysis of Leading Application Lifecycle Management Systems for...
 
are algorithms really a black box
are algorithms really a black boxare algorithms really a black box
are algorithms really a black box
 
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
Code decoupling from Symfony (and others frameworks) - PHP Conference Brasil ...
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)Optimizing Application Architecture (.NET/Java topics)
Optimizing Application Architecture (.NET/Java topics)
 
Xml processing-by-asfak
Xml processing-by-asfakXml processing-by-asfak
Xml processing-by-asfak
 
Beyond PITS, Functional Principles for Software Architecture
Beyond PITS, Functional Principles for Software ArchitectureBeyond PITS, Functional Principles for Software Architecture
Beyond PITS, Functional Principles for Software Architecture
 
Integration&SOA_v0.2
Integration&SOA_v0.2Integration&SOA_v0.2
Integration&SOA_v0.2
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...Trends in Programming Technology you might want to keep an eye on af Bent Tho...
Trends in Programming Technology you might want to keep an eye on af Bent Tho...
 
Using Pony for Fintech
Using Pony for FintechUsing Pony for Fintech
Using Pony for Fintech
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 

Declarative Multilingual Information Extraction with SystemT

  • 1. Declarative Multilingual Information Extraction with SystemT Alan Akbik and Laura Chiticariu IBM Research | Almaden Institute of Computer Science (ICSI), Berkeley, California 11/17/2016
  • 2. Outline • Use casesUse casesUse casesUse cases • RequirementsRequirementsRequirementsRequirements • SystemTSystemTSystemTSystemT • SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes • Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL 2
  • 3. Case Study 1: Social Data Analysis Text Analytics Product catalog, Customer Master Data, … Social Media • Products Interests360o Profile • Relationships • Personal Attributes • Life Events Statistical Analysis, Report Gen. Bank 6 23% Bank 5 20% Bank 4 21% Bank 3 5% Bank 2 15% Bank 1 15% ScaleScaleScaleScale 450M+ tweets a day, 100M+ consumers, … BreadthBreadthBreadthBreadth Buzz, intent, sentiment, life events, personal atts, … ComplexityComplexityComplexityComplexity Sarcasm, wishful thinking, … Customer 360º 3
  • 4. Case Study 2: Machine Analysis Web Server Logs CustomCustom Application Logs Database Logs Raw Logs Parsed Log Records Parse Extract Linkage Information End-to-End Application Sessions Integrate Graphical Models (Anomaly detection) Analyze 5 min 0.85 CauseCause EffectEffect DelayDelay CorrelationCorrelation 10 min 0.62 Correlation Analysis • Every customer has unique components with unique log record formats • Need to quickly customize all stages of analysis for these custom log data sources 4
  • 5. Outline • Use casesUse casesUse casesUse cases • RequirementsRequirementsRequirementsRequirements • SystemTSystemTSystemTSystemT • SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes • Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL 5
  • 6. Enterprise Requirements • ExpressivityExpressivityExpressivityExpressivity – Need to express complex algorithms for a variety of tasks and data sources • ScalabilityScalabilityScalabilityScalability – Large number of documents and large-size documents • TransparencyTransparencyTransparencyTransparency – Every customer’s data and problems are unique in some way – Need to easily comprehendcomprehendcomprehendcomprehend, debugdebugdebugdebug and enhanceenhanceenhanceenhance extractors 6
  • 7. Expressivity Example: Different Kinds of Parses We are raising our tablet forecast. S are NP We S raising NP forecastNP tablet DET our subj obj subj pred Natural Language Machine Log Dependency Tree Oct 1 04:12:24 9.1.1.3 41865: %PLATFORM_ENV-1-DUAL_PWR: Faulty internal power supply B detected Time Oct 1 04:12:24 Host 9.1.1.3 Process 41865 Category %PLATFORM_ENV-1- DUAL_PWR Message Faulty internal power supply B detected 7
  • 8. 88 Expressivity Example: Fact Extraction (Tables) Singapore 2012 Annual Report (136 pages PDF) Identify note breaking down Operating expenses line item, and extract opex components Identify line item for Operating expenses from Income statement (financial table in pdf document)
  • 9. • Social Media: Twitter alone has 500M+ messages / day; 1TB+ per day • Financial Data: SEC alone has 20M+ filings, several TBs of data, with documents range from few KBs to few MBs • Machine Data: One application server under moderate load at medium logging level 1GB of logs per day Scalability Example Scalability Example 9
  • 10. Transparency Example: Need to incorporate in-situ data Intel's 2013 capex is elevated at 23% of sales, above average of 16%Intel's 2013 capex is elevated at 23% of sales, above average of 16% FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury.FHLMC reported $4.4bn net loss and requested $6bn in capital from Treasury. I'm still hearing from clients that Merrill's website is better.I'm still hearing from clients that Merrill's website is better. Customer or competitor? Good or bad? Entity of interest I need to go back to Walmart, Toys R Us has the sameI need to go back to Walmart, Toys R Us has the same toy $10 cheaper! 10
  • 11. Outline • Use casesUse casesUse casesUse cases • RequirementsRequirementsRequirementsRequirements • SystemTSystemTSystemTSystemT • SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes • Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL 11
  • 12. AQL Language Optimizer Operator Runtime Specify extractor semantics declaratively Choose efficient execution plan that implements semantics Example AQL Extractor Fundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & TheoremsFundamental Results & Theorems • Expressivity:Expressivity:Expressivity:Expressivity: The class of extraction tasks expressible in AQL is a strict superset of that expressible through cascaded regular automata. • Performance:Performance:Performance:Performance: For any acyclic token-based finite state transducer T, there exists an operator graph G such that evaluating T and G has the same computational complexity. create view PersonPhone as select P.name as person, N.number as phone from Person P, PhoneNumber N, Sentence S where Follows(P.name, N.number, 0, 30) and Contains(S.sentence, P.name) and Contains(S.sentence, N.number) and ContainsRegex(/b(phone|at)b/, SpanBetween(P.name, N.number)); Within asinglesentence <Person> <PhoneNum> 0-30chars Contains“phone” or “at” Within asinglesentence <Person> <PhoneNum> 0-30chars Contains“phone” or “at” SystemTSystemTSystemTSystemT: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006: IBM Research project started in 2006 More than 35 papers across ACL, EMNLP, VLDB, PODS and SIGMOD SystemT Architecture 12
  • 13. Expressivity: Rich Set of Operators Deep Syntactic Parsing ML Training & Scoring Core Operators Tokenization Parts of Speech Dictionaries Semantic Role Labels Language to express NLP Algorithms AQL .Regular Expressions HTML/XML Detagging Span Operations Relational Operations Aggregation Operations 13
  • 14. package com.ibm.avatar.algebra.util.sentence; import java.io.BufferedWriter; import java.util.ArrayList; import java.util.HashSet; import java.util.regex.Matcher; public class SentenceChunker { private Matcher sentenceEndingMatcher = null; public static BufferedWriter sentenceBufferedWriter = null; private HashSet<String> abbreviations = new HashSet<String> (); public SentenceChunker () { } /** Constructor that takes in the abbreviations directly. */ public SentenceChunker (String[] abbreviations) { // Generate the abbreviations directly. for (String abbr : abbreviations) { this.abbreviations.add (abbr); } } /** * @param doc the document text to be analyzed * @return true if the document contains at least one sentence boundary */ public boolean containsSentenceBoundary (String doc) { String origDoc = doc; /* * Based on getSentenceOffsetArrayList() */ // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) {doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); String candidate = /* * Looks at the last character of the String. If this last * character is part of an abbreviation (as detected by * REGEX) then the sentenceString is not a fullSentence and * "false” is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { // sentences.addElement(candidate.trim().replaceAll("n", " // ")); // sentenceArrayList.add(new Integer(currentOffset + boundary // + 1)); // currentOffset += boundary + 1; // Found a sentence boundary. If the boundary is the last // character in the string, we don't consider it to be // contained within the string. int baseOffset = currentOffset + boundary + 1; if (baseOffset < origDoc.length ()) { // System.err.printf("Sentence ends at %d of %dn", // baseOffset, origDoc.length()); return true; } else { return false; } } // origDoc.substring(0,currentOffset)); // doc = doc.substring(boundary + 1); doc = remainder; } } while (boundary != -1); // If we get here, didn't find any boundaries. return false; } public ArrayList<Integer> getSentenceOffsetArrayList (String doc) { ArrayList<Integer> sentenceArrayList = new ArrayList<Integer> (); // String origDoc = doc; // int dotpos, quepos, exclpos, newlinepos; int boundary; int currentOffset = 0; sentenceArrayList.add (new Integer (0)); do { /* Get the next tentative boundary for the sentenceString */ setDocumentForObtainingBoundaries (doc); boundary = getNextCandidateBoundary (); if (boundary != -1) { String candidate = doc.substring (0, boundary + 1); String remainder = doc.substring (boundary + 1); /* * Looks at the last character of the String. If this last character * is part of an abbreviation (as detected by REGEX) then the * sentenceString is not a fullSentence and "false" is returned */ // while (!(isFullSentence(candidate) && // doesNotBeginWithCaps(remainder))) { while (!(doesNotBeginWithPunctuation (remainder) && isFullSentence (candidate))) { /* Get the next tentative boundary for the sentenceString */ int nextBoundary = getNextCandidateBoundary (); if (nextBoundary == -1) { break; } boundary = nextBoundary; candidate = doc.substring (0, boundary + 1); remainder = doc.substring (boundary + 1); } if (candidate.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + boundary + 1)); currentOffset += boundary + 1; } // origDoc.substring(0,currentOffset)); doc = remainder; } } while (boundary != -1); if (doc.length () > 0) { sentenceArrayList.add (new Integer (currentOffset + doc.length ())); } sentenceArrayList.trimToSize (); return sentenceArrayList; } private void setDocumentForObtainingBoundaries (String doc) { sentenceEndingMatcher = SentenceConstants. sentenceEndingPattern.matcher (doc); } private int getNextCandidateBoundary () { if (sentenceEndingMatcher.find ()) { return sentenceEndingMatcher.start (); } else return -1; } private boolean doesNotBeginWithPunctuation (String remainder) { Matcher m = SentenceConstants.punctuationPattern.matcher (remainder); return (!m.find ()); } private String getLastWord (String cand) { Matcher lastWordMatcher = SentenceConstants.lastWordPattern.matcher (cand); if (lastWordMatcher.find ()) { return lastWordMatcher.group (); } else { return ""; } } /* * Looks at the last character of the String. If this last character is * par of an abbreviation (as detected by REGEX) * then the sentenceString is not a fullSentence and "false" is returned */ private boolean isFullSentence (String cand) { // cand = cand.replaceAll("n", " "); cand = " " + cand; Matcher validSentenceBoundaryMatcher = SentenceConstants.validSentenceBoundaryPattern.matcher (cand); if (validSentenceBoundaryMatcher.find ()) return true; Matcher abbrevMatcher = SentenceConstants.abbrevPattern.matcher (cand); if (abbrevMatcher.find ()) { return false; // Means it ends with an abbreviation } else { // Check if the last word of the sentenceString has an entry in the // abbreviations dictionary (like Mr etc.) String lastword = getLastWord (cand); if (abbreviations.contains (lastword)) { return false; } } return true; } } Java Implementation of Sentence Boundary Detection Expressivity: AQL vs. Custom Java Code create dictionary AbbrevDict from file 'abbreviation.dict’; create view SentenceBoundary as select R.match as boundary from ( extract regex /(([.?!]+s)|(ns*n))/ on D.text as match from Document D ) R where Not(ContainsDict('AbbrevDict', CombineSpans(LeftContextTok(R.match, 1),R.match))); Equivalent AQL Implementation 14
  • 15. Transparency in SystemT PersonPhone • Clear and simple provenanceprovenanceprovenanceprovenance Person PhonePerson Anna at James St. office (555-5555) . [Liu et al., VLDB’10] Ease of comprehension Ease of debugging Ease of enhancement • Explains output data in terms of input data, intermediate data, and the transformation • For predicate-based rule languages (e.g., SQL), can be computed automatically create view PersonPhone as select P.name as person, N.number as phone from Person P, PhoneNumber N where Follows(P.name, N.number, 0, 30); personpersonpersonperson phonephonephonephone Anna 555-5555 James 555-5555 Person namenamenamename Anna James PhoneNumber numbernumbernumbernumber 555-5555t1 t2 t3 t1 ∧ t3 t2 ∧ t3 15
  • 16. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts: – Embeddable Models in AQL – Learning using AQL as target language 16
  • 17. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts – Embeddable Models in AQL • Model Training and Scoring integration with SystemML • Semantic Role LabelingSemantic Role LabelingSemantic Role LabelingSemantic Role Labeling [Akbik et al., ACL’15,’16, EMNLP’16, COLING’16] • Text Normalization [Zhang et al. ACL’13, Li & Baldwin, NAACL ‘15] – Learning using AQL as target language 17
  • 18. Machine Learning in SystemT • AQL provides a foundation of transparency • Next step: Add machine learning without losing transparency • Major machine learning efforts – Embeddable Models in AQL – Learning using AQL as target language • Low-level features: regular expressions [Li et al., EMNLP’08], dictionaries [Li et al., CIKM’11, Roy et al., SIGMOD’13] • Rule induction [Nagesh et al., EMNLP’12] • Rule refinement [Liu et al., VLDB’10] • Ongoing research 18
  • 19. Web Tools Overview Ease of Programming Ease of Sharing 19
  • 20. Outline • Use casesUse casesUse casesUse cases • RequirementsRequirementsRequirementsRequirements • SystemTSystemTSystemTSystemT • SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes • Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL 20
  • 21. SystemT in Universities 2013 2015 2016 2017 • UC Santa Cruz (full Graduate class) 2014 • U. Washington (Grad) • U. Oregon (Undergrad) • U. Aalborg, Denmark (Grad) • UIUC (Grad) • U. Maryland Baltimore County (Undergrad) • UC Irvine (Grad) • NYU Abu-Dhabi (Undergrad) • U. Washington (Grad) • U. Oregon (Undergrad) • U. Maryland Baltimore County (Undergrad) • • UC Santa Cruz, 3 lectures in one Grad class SystemT MOOC 21
  • 22. AlonAlonAlonAlon Halevy, November 2015Halevy, November 2015Halevy, November 2015Halevy, November 2015 "The tutorial of System T was an important hands-on component of a Ph.D course on Structured Data on the Web at the University of Aalborg in Denmark taught by Alon Halevy. The tutorial did a great job giving the students the feeling for the challenges involved in extracting structured data from text.“ ShimeiShimeiShimeiShimei Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016Pan, University of Maryland, Baltimore County, May 2016 "The Information Extraction with SystemT class, which was offered for the first time at UMBC, has been a wonderful experience for me and my students. I really enjoyed teaching this course. Students were also very enthusiastic. Based on the feedback from my students, they have learned a lot. Some of my students even want me to offer this class every semester.“ SystemT in Universities: Testimonials 22
  • 23. • Professors like teaching it • Students do interesting projects and have fun! • Identify character relationships in Games of Thrones • Determine the cause and treatment for PTSD patients • Extract job experiences and skills from resumes • … SystemT in Universities: Summary 23
  • 24. SystemT MOOC • Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students • Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist e.g., University students (grad, undergrad), industry professionals • Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes TA0105: Text Analytics: Getting Results with SystemT – Basics of IE with SystemT, learning to write extractors using the web tool, learn AQL – 5 lectures + 1 hands-on lab + final exam TA0106EN: Advanced Text Analytics: Getting Results with SystemT – Advanced material on declarative approach and SystemT optimization – 2 lectures + final exam • Bring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ studentsBring SystemT to 100,000+ students • Target Audience:Target Audience:Target Audience:Target Audience: NLP Engineer, Data Scientist e.g., University students (grad, undergrad), industry professionals • Learning Path with two classesLearning Path with two classesLearning Path with two classesLearning Path with two classes TA0105: Text Analytics: Getting Results with SystemT – Basics of IE with SystemT, learning to write extractors using the web tool, learn AQL – 5 lectures + 1 hands-on lab + final exam TA0106EN: Advanced Text Analytics: Getting Results with SystemT – Advanced material on declarative approach and SystemT optimization – 2 lectures + final exam 24
  • 25. Outline • Use casesUse casesUse casesUse cases • RequirementsRequirementsRequirementsRequirements • SystemTSystemTSystemTSystemT • SystemT ClassesSystemT ClassesSystemT ClassesSystemT Classes • Multilingual SRLMultilingual SRLMultilingual SRLMultilingual SRL 25
  • 26. • 7,102 known languages • 23 most spoken language > 4.1 billion people https://walizahid.com/2015/06/the-worlds-most-spoken-languages/ • Our goal: Enable Text Analytics over Multilingual Text The World’s Most Spoken Languages 26
  • 27. ProblemProblemProblemProblem: High costs to handle multilingual data English NLP English Text Analytics German NLP German Text Analytics Current StatusCurrent StatusCurrent StatusCurrent Status: - One language at a time; - Query language and data must be in the same language SeparateSeparateSeparateSeparate parsers / features for each language French NLP French Text Analytics Chinese NLP Chinese Text Analytics SeparateSeparateSeparateSeparate text analyics must be developed EnglishEnglishEnglishEnglish Text Data ChineseChineseChineseChinese Text Data FrenchFrenchFrenchFrench Text Data GermanGermanGermanGerman Text Data CrosslingualCrosslingualCrosslingualCrosslingual Text AnalyticsText AnalyticsText AnalyticsText Analytics: Parse and develop against unified abstraction; Unified NLP Crosslingual TextAnalytics Semantic parserSemantic parserSemantic parserSemantic parser • arbitrary languages • unified feature setunified feature setunified feature setunified feature set Universal extractorsUniversal extractorsUniversal extractorsUniversal extractors • Develop onceonceonceonce • Run for all languages Proposed Approach: Crosslingual Text Analytics 27
  • 28. SRL focuses on semantics, not syntax Useful for many applications (Shen and Lapata, 2007; Maqsud et al., 2014) FrameFrameFrameFrame semanticssemanticssemanticssemantics crosslinguallycrosslinguallycrosslinguallycrosslingually valid?valid?valid?valid? Dirk broke the window with a hammer. Break.01A0 A1 A2 The window was broken by Dirk. The window broke. A1 Break.01 A0 A1 Break.01 Break.01 A0 – Breaker A1 – Thing broken A2 – Instrument A3 – Pieces Break.15 A0 – Journalist, exposer A1 – Story, thing exposed • Problem: What type of labels are valid across languages? • Lexical, morphological and syntactic labels differ greatly • But shallow semantic labelsshallow semantic labelsshallow semantic labelsshallow semantic labels remain stable Semantic Role Labeling 28
  • 29. The station bought Cosby reruns for record prices Buy.01A0 spurred dollar buying by Japanese institutions. Spur.01 A1 Buy.01 A0 The company payed 2 million US$ to buy . Corpus of annotated text dataFrame set A1 A3 Buy.01 A0 – Buyer A1 – Thing bought A2 – Seller A3 – Price paid A4 – Benefactive Pay.01 A0 – Payer A1 – Money A2 – Being payed A3 – Commodity Buy.01Pay.01A0 A1 • English SRL – FrameNet – PropBank SRL Resources 29
  • 30. • English SRL – FrameNet – PropBank 30 1. Limited coverage 2. Language-specific formalisms • Other languages – Chinese Proposition Bank – Hindi Proposition Bank – German FrameNet – French? Spanish? Russian? Arabic? … 订购 A0: buyer A1: commodity A2: seller order.02 A0: orderer A1: thing ordered A2: benefactive, ordered-for A3: source SRL for other languages? We want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labelsWe want different languages to share the same semantic labels
  • 31. WhatsApp was bought by Facebook Facebook hat WhatsApp gekauft Facebook a achété WhatsApp buy.01 Facebook WhatsApp Crosslingual representationMultilingual input text Buy.01 A0A1 Buy.01A1A0 Buy.01A0 A1 Shared Frames Across Languages 31
  • 32. • 1. Generation of Proposition Banks for Arbitrary Languages • 2. Development of a SotA Semantic Role Labeler • 3. Application to Use Cases Research 32
  • 33. Our Goal • Generate SRL resources for many other languages • Shared frame set • Minimal effort Frame set Buy.01 A0 – Buyer A1 – Thing bought A2 – Seller A3 – Price paid A4 – Benefactive Pay.01 A0 – Payer A1 – Money A2 – Being payed A3 – Commodity Il faut qu‘ il y ait des responsables Need.01A0 Je suis responsable pour le chaos Be.01A1 A2 AM-PRD Les services postaux ont achété des Be.01 A2A1 Buy.01A0 Corpus of annotated text data 33
  • 34. Example: TV subtitles Generating Universal Proposition Banks Current practice: Annotator training (months) Annotation (years) Must be repeated for each language! Idea: Annotation projection in parallel corpora Das würde ich für einen Dollar kaufen German subtitles I would buy that for a dollar! English subtitles PRICEBUYER ITEM BUYERITEM I would buy that for a dollar! PRICE projection Das würde ich für einen Dollar kaufenDas würde ich für einen Dollar kaufen 34
  • 35. However: Projections Not Always Possible We need to hold people responsible Il faut qu‘ il y ait des responsables English sentence: Target sentence: Hold.01 A0 A1 A3 Need.01 Hold.01 Incorrect projection! There need to be those responsible A1 Main error sources:Main error sources:Main error sources:Main error sources: • Translation shift • Source-language SRL errors 35
  • 36. • Two-step process of filtering and bootstrapping (ACL ‘15) – Filters to detect translation shift, block projections (more precision at cost of recall) – Bootstrap learning to increase recall – Generate 7 universal proposition banks7 universal proposition banks7 universal proposition banks7 universal proposition banks from 3 language groups Improving Projection Quality 36
  • 37. • Online demo for SRL in 9 languages (ACL ‘16) Train Multilingual SRL 37
  • 38. Low-resource Languages & Crowdsourcing • Apply approach to low-resource languages (EMNLP ‘16) – Bengali, Malayalam, Tamil – Fewer sources of parallel data – Almost no NLP: No syntactic parsing, lemmatization etc. • Crowdsourcing for data curation 38
  • 39. Multilingual Aliasing • Expert curation of frame mappings (COLING ‘16) • ProblemProblemProblemProblem: Target language frame lexicon automatically generated from alignments – False frames – Redundant frames • Two-step expert curation approach – Step 1: Filter – Step 2: Merge • Example: drehen – “to turn” (rotate) – “to film” 39
  • 40. Original drehen. 04 SHOOT.03 record on film A0: videographer A1: subject filmed A2: medium drehen. 01 TURN.01 rotation A0: turner A1: thing turning Am: direction drehen. 02 TURN.02 transformation A0: causer of transformation A1: thing changing A2: end state A3: start state record on film A0: recorder A1: thing filmed drehen. 03 FILM.01 incorrect frame, filtered Merged synonyms, merged Predicate: drehendrehendrehendrehen Filtered drehen. 01 TURN.01 rotation A0: turner A1: thing turning Am: direction record on film A0: recorder A1: thing filmed drehen. 03 FILM.01 drehen. 01 TURN.01 rotation A0: turner A1: thing turning Am: direction drehen. 03 FILM.01 SHOOT.03 record on film A0: recorder A1: thing filmed record on film A0: videographer A1: subject filmed A2: medium drehen. 04 SHOOT.03 Multilingual Aliasing: Example 40
  • 41. Multilingual Aliasing • Expert curation of frame mappings (COLING ‘16) – 60 person hours per language – Chinese, French and German – More salient senses, improved quality 41
  • 42. Automatic Generation of Proposition Banks • Project English PropBank frame and role annotation onto target language sentences • A variety of strategies to increase quality: – Filtering of translation shift – Crowdsourced data curation – Expert frame alignments • Result: Generated universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8universal proposition banks for 8 languageslanguageslanguageslanguages • Currently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source releaseCurrently preparing open source release 42
  • 43. • 1. Generation of Proposition Banks for Arbitrary Languages • 2. Development of a SotA Semantic Role Labeler • 3. Application to Use Cases Research 43
  • 44. • ProblemProblemProblemProblem: Current SRL error-prone • Strong local bias, many lowStrong local bias, many lowStrong local bias, many lowStrong local bias, many low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize Instance-based Semantic Role Labeling 44
  • 45. • ProblemProblemProblemProblem: Current SRL error-prone • Strong local biasStrong local biasStrong local biasStrong local bias • Many lowMany lowMany lowMany low----frequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalizefrequency exceptions, difficult to generalize Instance-based Semantic Role Labeling • K-SRL: Instance-based learning – K-nearest neighbors • Composite features • Significantly outperform all other current systems In-domain Out-of-domain 45
  • 46. • 1. Generation of Proposition Banks for Arbitrary Languages • 2. Development of a SotA Semantic Role Labeler • 3. Application to Use Cases Research 46
  • 48. Thank you! • For more information… – Visit our website:Visit our website:Visit our website:Visit our website: http://ibm.co/1Cdm1Mj – Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center:Visit BigInsights Knowledge Center: http://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEvhttp://ibm.co/1DIouEv – Learning at your own pace:Learning at your own pace:Learning at your own pace:Learning at your own pace: SystemT MOOCSystemT MOOCSystemT MOOCSystemT MOOC – Contact usContact usContact usContact us • akbika@us.ibm.com • chiti@us.ibm.com SystemT Team: Alan Akbik, Laura Chiticariu, Marina Danilevsky, Howard Ho, Rajasekar Krishnamurthy, Yunyao Li, Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan, Huaiyu Zhu 48
  • 49. Declarative AQL language Development Environment AQL Extractor create view ProductMention as select ... from ... where ... create view IntentToBuy as select ... from ... where ... Cost-based optimization ... SystemT: Declarative Information Extraction Discovery tools for AQL development SystemT RuntimeSystemT Runtime Input Documents Extracted Objects Challenge: Building extractors for enterprise applications requires an information extraction system that is expressive, efficient, transparent and usable. Existing solutions are either rule-based solutions based on cascading grammar with expressivity and efficiency issues, or black-box solutions based on machine learning with lack of transparency. Our Solution: A declarative information extraction system with cost-based optimization, high-performance runtime and novel development tooling based on solid theoretical foundation [PODS’13, PODS’14], shipping with over 10+ IBM products. AQL: a declarative language that can be used to build extractors outperforming the state-of-the-arts [ACL’10] Multilingual SRL-enabled: [ACL’15, ACL’16, EMNLP’16, COLING’16] A suite of novel development tooling leveraging machine learning and HCI [EMNLP’08, VLDB’10, ACL’11, CIKM’11, ACL’12, EMNLP’12, CHI’13, SIGMOD’13, ACL’13,VLDB’15, NAACL’15] Cost-based optimization for text-centric operations [ICDE’08, ICDE’11, FPL’13, FPL’14] Highly embeddable runtime with high-throughput and small memory footprint. [SIGMOD Record’09, SIGMOD’09] For details and Online Class visit: https://ibm.biz/BdF4GQ 49