SlideShare a Scribd company logo
Speeding up Information
Extraction Programs
Helena Galhardas
INESC-ID and IST, Universidade de Lisboa
Who am I?
•  Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt)
•  Researcher at INESC-ID (http://www.inesc-id.pt)
•  Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles
•  Main research interests:
–  Data Cleaning
•  Support for User Involvement in Data Cleaning, DaWaK 2011
•  Automatic optimization of the match operation (similarity join)
–  Medical Information Systems
•  A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014
•  ProntApp: A Mobile Question Answering System for Medicines
–  Information Extraction
4/3/15 2 2	
  
Who am I?
•  Assistant Professor at IST, University of Lisbon (http://
www.ist.utl.pt)
•  Researcher at INESC-ID (http://www.inesc-id.pt)
•  Former (2000) PhD student at INRIA Rocquencourt/Univ.
Versailles
•  Main research interests:
–  Data Cleaning
•  Support for User Involvement in Data Cleaning, DaWaK 2011
•  Automatic optimization of the match operation (similarity join)
–  Medical Information Systems
•  A Machine Learning based Natural Language Interface for a
Database of Medicines, poster paper DILS 2014
•  ProntApp: A Mobile Question Answering System for Medicines
Ø  Information Extraction
4/3/15 3 3	
  
Agenda
•  A Holistic Optimizer for Speeding up Information
Extraction Programs
–  PVLDB’13 and presented at VLDB’14
–  Joint work with Gonçalo Simões, INESC-ID and IST/Univ.
Lisboa and Luis Gravano, Columbia University
•  A Learning-based Approach to Rank Documents
(briefly)
–  EDBT’15
–  Joint work with Pablo Barrio, Columbia University, Gonçalo
Simões and Luis Gravano
4	
  
Information Extraction Discovers
Structured Information In Text Documents
The	
  erup+on	
  of	
  	
  Grímsvötn	
  	
  during	
  22-­‐25	
  
May	
  2011	
  brought	
  back	
  the	
  memories	
  of	
  
the	
  erup+ons	
  of	
  EyjaEallajökull	
  in	
  2010.	
  
Natural	
  Disaster	
  
erup+on	
  of	
  	
  Grímsvötn	
  
erup+ons	
  of	
  EyjaEallajökull	
  
Temporal	
  Expression	
  
22-­‐25	
  May	
  2011	
  
2010	
  
Temporal	
  Expression	
  
22-­‐25	
  May	
  2011	
  
Natural	
  Disaster	
   Temporal	
  Expression	
  
erup+on	
  of	
  	
  Grímsvötn	
   22-­‐25	
  May	
  2011	
  
Extract	
  
Occurs-­‐In	
  
Extract	
  
Natural	
  Disaster	
  
Select	
  
Temporal	
  Expressions	
  
From	
  2011	
  
Extract	
  
Temporal	
  Expressions	
  	
  
Much	
  richer	
  querying	
  
and	
  analysis	
  
5	
  
IE	
  SpecificaCon	
  
Naïve	
  ExecuCon	
  Plan	
  
Information Extraction Specification
and Execution Plans are Graphs of
Operators
Type	
  of	
  Operator:	
  
EE:	
  En+ty	
  Extractor	
  
RE:	
  Rela+on	
  Extractor	
  
Technique	
  
Algorithm	
  
Information Extraction is Challenging
and Time-consuming
•  Operates on a large set of features
•  Relies on complex text analysis
•  Often applied over large document collections
erupCon	
  
brought	
  
memories	
  
erupCons	
  
erupCon	
  
bring	
  
memory	
  
erupCon	
  
Grímsvötn	
  
22-­‐25	
  
brought	
  
2010	
  
Cc+	
  
N+.N+	
  
c+	
  
N+	
  
Ccccccccc	
  
NN.NN	
  
ccccccc	
  
NNNN	
  
The	
   erupCon	
   brought	
   back	
   the	
   memories	
   the	
   erupCons	
   2010	
  
det	
   nsubj	
   prt	
   det	
  
dobj	
   prep_of	
  
prep_of	
  det	
  
LemmaCzaCon	
  Features	
   Character	
  CapitalizaCon	
  Features	
  
7	
  
IE Implementation Choices to the Rescue
•  These implementation choices have an impact on execution time and quality of results
(recall and precision)
–  Recall: fraction of correct tuples that the plan produces
–  Precision: fraction of produced tuples that are correct
D
A	
  
C
B
D1	
  
A1	
  
A2	
  
A3	
  
C1	
  
B1	
  
B2	
  
Choosing	
  an	
  Algorithm	
  for	
  Each	
  Operator	
   Choosing	
  Operator	
  ExecuCon	
  Order	
  
A	
  
D
C
B
A	
  
D
C
B
A	
  
D
C
B
A2	
  
D1	
  
C1	
  
B1	
  
ExecuCon	
  Plan	
  
Query(210K	
  docs)	
  
EECRF	
  
EECRF	
  
	
  	
  	
  	
  	
  	
  Viterbi	
  
EECRF	
  
	
  	
  VBS-­‐10	
  
EECRF	
  
	
  	
  VBS-­‐5	
  
RESVM	
   RESVM	
  
	
  	
  	
  	
  	
  	
  All	
  SVM	
  
Choosing	
  Document	
  Retrieval	
  Strategy	
  
Query	
  for	
  
Relevant	
  Docs	
  
Scan	
  whole	
  
CollecCon	
  
1.	
  Execute	
  A	
  
2.	
  Find	
  sentences	
  that	
  
produced	
  results	
  for	
  A.	
  
3.	
  Execute	
  B	
  with	
  those	
  
sentences	
  only	
  
Query	
  for:	
  
[“natural	
  disaster”],	
  
[earthquake],	
  
[erup+on],	
  
[tornado],	
  (…)	
  
8	
  
State-of-the-art IE Optimization Systems
D
A	
  
C
B
D1	
  
A1	
  
A2	
  
A3	
  
C1	
  
B1	
  
B2	
  
Choosing	
  an	
  Algorithm	
  for	
  Each	
  Operator	
   Choosing	
  Operator	
  ExecuCon	
  Order	
  
A	
  
D
C
B
A	
  
D
C
B
A	
  
D
C
B
SystemT	
   CIMPLE	
  
¡  Both use a cost-based optimizer to choose the execution order of operators
¡  System T uses heuristics to choose among different algorithms for an operator
¡  Do not allow approximate results
Choosing	
  Document	
  Retrieval	
  Strategy	
  
Query	
  for	
  
Relevant	
  Docs	
  
Scan	
  whole	
  
CollecCon	
  
•  An information extraction optimizer should make
these implementation choices collectively!
D
A	
  
C
B
D1	
  
A1	
  
A2	
  
A3	
  
C1	
  
B1	
  
B2	
  
Choosing	
  an	
  Algorithm	
  for	
  Each	
  Operator	
   Choosing	
  Operator	
  ExecuCon	
  Order	
  
A	
  
D
C
B
A	
  
D
C
B
A	
  
D
C
B
SQoUT	
  
Choosing	
  Document	
  Retrieval	
  Strategy	
  
Query	
  for	
  
Relevant	
  Docs	
  
Scan	
  whole	
  
CollecCon	
  
10	
  
State-of-the-art IE Optimization Systems
Making implementation choices
collectively leads to faster plans
A	
  
D
C
B
A1	
  
D1	
  
C1	
  
B1	
  
Scan(1M	
  docs)	
  
Time:	
  
7h	
  30	
  m	
  
Recall:	
  
93%	
  
Precision:	
  
100%	
  
A3	
  
D1	
  
C1	
  
B2	
  
Scan(1M	
  docs)	
  
Time:	
  
9h	
  23	
  m	
  
Recall:	
  
91%	
  
Precision:	
  
95%	
  
A1	
  
D1	
  
C1	
  
B1	
  
Scan(1M	
  docs)	
  
Time:	
  
13h	
  16	
  m	
  
Recall:	
  
100%	
  
Precision:	
  
100%	
  
Naive	
  Plan	
  
A1	
  
D1	
  
C1	
  
B1	
  
Query(153K	
  docs)	
  
Time:	
  
2h	
  07	
  m	
  
Recall:	
  
50%	
  
Precision:	
  
100%	
  
•  By combining multiple choices we get a faster plan
•  We only control the quality results with collective
optimization
Algorithms	
  
OpCmizaCon	
  
Operator	
  Order	
  
OpCmizaCon	
  
Document	
  Retrieval	
  
OpCmizaCon	
  
A3	
  
D1	
  
C1	
  
B2	
  
Query(210K	
  docs)	
  
Time:	
  
1h	
  31	
  m	
  
Recall:	
  
50%	
  
Precision:	
  
96%	
  
CollecCvely	
  OpCmized	
  Plan	
  
A3	
  
D1	
  
C1	
  
B2	
  
Query(153K	
  docs)	
  
Time:	
  
57	
  m	
  
Recall:	
  
39%	
  
Precision:	
  
98%	
  
Direct	
  CombinaCon	
  Plan	
  
Hard	
  to	
  control	
  the	
  expected	
  
recall	
  and	
  precision	
  
Precision	
  >=	
  90%	
  
Recall	
  >=	
  50%	
  
Recall	
  and	
  Precision	
  
Constraints	
  
Holistic Optimizer for Information
Extraction
A2	
  
D1	
  
C1	
  
B1	
  
Query(210K	
  docs)	
  
Time:	
  
43	
  m	
  
Recall:	
  
50%	
  
Precision:	
  
95%	
  
Best	
  Plan	
  
Document	
  CollecCon	
  
A	
  
IE	
  SpecificaCon	
  
Precision	
  >=	
  90%	
  Recall	
  >=	
  50%	
  
Recall	
  and	
  Precision	
  Constraints	
  
A3	
  
D1	
  
C1	
  
B2	
  
A2	
  
D1	
  
C1	
  
B1	
  
A2	
  
D1	
  
C1	
  
B1	
  
Query (210K docs) Query (210K docs) Query (210K docs)
Candidate Plans Best Plan
Document	
  
Sample	
  
Predictor	
  
Parameters	
  
Recall	
  and	
  
Precision	
  
Predictor	
  
Time:
57 m
Recall:
39%
Precision:
78%
A2	
  
D1	
  
C1	
  
B1	
  
Time:
1h 24 m
Recall:
98%
Precision:
95%Query
A3	
  
D1	
  
C1	
  
B2	
  
Query
HolisCc	
  
OpCmizer	
  
(…)	
  
EnumeraCon	
  of	
  ExecuCon	
  Plans	
  
A2	
  
D1	
  
C1	
  
B1	
  
A3	
  
D1	
  
C1	
  
B2	
  
Predictor Parameters Estimation
•  Predictor parameters are determined from a
document sample
•  Three phases:
1.  Sampling to select the subset of input documents to
use in the estimation (stratified sampling)
2.  Extraction to retrieve tuples from the sample using
the plans produced by the enumeration step
(dynamic programming)
3.  Estimation to determine the optimizer parameters
using the results of 2) (Maximum-a-posteriori
estimation)
Document	
  Sample	
  
Predictor	
  Parameters	
  
Recall	
  and	
  Precision	
  
Predictor	
  
Extraction Quality Prediction
•  Based on two predictions:
– Number of input tuples of operator
– Number of output tuples of operator
A2	
  
D1	
  
C1	
  
(…)	
   (…)	
  
Path	
  1	
   Path	
  2	
  
D1	
  
t1	
  
t2	
  
t3	
  
t4	
  
(…)	
  
•	
   •	
   •	
  
•	
   •	
  
•	
   •	
   •	
  
•	
   •	
  
A	
   A2	
  A1	
   A3	
  
t1	
  
t2	
  
t3	
  
(…)	
  
t1	
  
t2	
   t1	
  
t1	
  
t3	
  
Experimental Validation
•  Impact of the Sample Size and Prediction
Quality
•  Impact of the Parameter Estimation
Strategy
•  Impact of Individual Implementation
Choices
•  Impact of Precision Constraints
•  Comparison With the State-of-the-Art
Techniques
15	
  
Experimental Validation
•  Impact of the Sample Size and Prediction
Quality
•  Impact of the Parameter Estimation
Strategy
Ø Impact of Individual Implementation
Choices
•  Impact of Precision Constraints
Ø Comparison With the State-of-the-Art
Techniques
16	
  
Experimental Setup
•  Datasets
•  Information Extraction Programs
Researchers’	
  Homepages	
  
Kedar	
  Bellare’	
  Homepage	
  
	
  
PhD	
  student	
  at	
  University	
  of	
  
Massachusegs,	
  Amherst	
  under	
  
supervision	
  of	
  Andrew	
  
McCallum.	
  
	
  
	
  
Publica+ons	
  
	
  
Generalized	
  Expecta+on	
  Criteria	
  
for	
  Bootstrapping	
  Extractors	
  
using	
  Record-­‐Text	
  Alignment,	
  
EMNLP	
  2009,	
  August	
  2009	
  
Advisor	
   Advisee	
  
Andrew	
  
McCallum	
  
Kedar	
  Bellare	
  
Conference	
   Date	
  
EMNLP	
  2009	
   August	
  2009	
  
Advises	
  
ConferencesDates	
  
The	
  Hai+	
  cholera	
  outbreak	
  
between	
  2010	
  and	
  2013	
  was	
  the	
  
worst	
  epidemic	
  of	
  cholera	
  in	
  
recent	
  history	
  
Google	
  co-­‐founders	
  Larry	
  Page	
  
and	
  Sergey	
  Brin	
  recently	
  sat	
  
down	
  with	
  billionaire	
  venture	
  
capitalist	
  Vinod	
  Khosla	
  for	
  a	
  
lengthy	
  interview.	
  
Disease	
   Time	
  Period	
  
Cholera	
   Between	
  2010	
  and	
  
2013	
  
DiseaseOutbreaks	
  
Person	
   OrganizaCon	
  
Larry	
  Page	
   Google	
  
Sergey	
  Brin	
   Google	
  
Organiza+onAffilia+on	
  
Sparse	
  RelaCons	
  
Dense	
  RelaCons	
   17	
  
Experimental Setup (cont.)
•  IE Optimization Systems (baselines):
–  Cimple
–  System T
–  SQoUT
–  SQoUt-Boosted: variation of SQoUT that uses the execution
plans produced by Cimple and System T
•  IE Optimization Techniques:
–  Optimized-Ret: implementation choice of retrieval strategy
–  Optimized-Alg: implementation choice of the algorithms
–  Optimized-Order: implementation choice of the operator execution
order
–  Holistic-Map: uses maximum a posteriori for estimating the
parameters
–  Holistic-ML: uses maximum likelihood for estimating the parameters
18	
  
Impact of Individual Implementation
Choices
•  Holistic-MAP obtains the fastest plan in every case
•  Optimized-Ret becomes too slow when desired recall
increases
•  Optimized-Order and Optimized-Alg consistently produce slow
plans
•  Holistic optimization outperforms individual optimization
ConferencesDates	
  
19	
  
Comparison with State-of-the-Art
Techniques
•  For dense relations, Holistic-MAP
–  Obtains faster plans
–  Closely matches recall constraints
OrganizaConAffiliaCon	
  
20	
  
Comparison with State-of-the-Art
Techniques
•  For sparse relations, Holistic-MAP
–  Obtains faster plans
–  Does not always matches recall constraints
–  Can correct the errors in recall constraints as the target recall
increases
DiseaseOutbreaks	
  
21	
  
Conclusions: Holistic Optimizer
•  IE programs can be optimized along multiple
interrelated dimensions, namely, the choice of:
–  Algorithms for each extraction operator
–  Operator execution order
–  Document retrieval strategy
•  We presented the first holistic optimization
approach for IE that collectively optimizes all
dimensions simultaneously
–  It outperforms the state-of-the-art techniques and
selects efficient plans that meet target recall and
precision values
22	
  
What Comes Next?
•  Ranking Models for IE (second part of the
talk)
•  Distributed Executions for IE (on-going work)
–  by determining document placement in
distributed file system
Random	
  
order	
  
Ranked	
  
order	
  
Agenda
•  A Holistic Optimizer for Speed up Information
Extraction Programs
–  PVLDB’13 and presented at VLDB’14
–  Joint work with Gonçalo Simões and Luis
Gravano
•  A learning-based Approach to Rank
Documents
–  EDBT’15
–  Joint work with Pablo Barrio, Gonçalo Simões
and Luis Gravano
24	
  
Reducing IE Processing Time
•  Small, topic-specific fraction of the entire
collection is useful
Ex: Only 2% of documents in a New York Times archive,
mostly environment-related, are useful for Natural Disaster-
Location with a state-of-the-art IE system
•  Useful documents share distinctive words
and phrases
Ex: “earthquake”, “storm,” “richter,” “volcano
eruption” for Natural Disaster-Location
•  Information extraction system labels
documents as useful or not, for free
Should focus extraction
over these documents
and ignore rest
Should learn to
distinguish useful
documents for an IE task
Could use the result of
an IE process to
generate an ever-
expanding training set
for learning and further
identifying useful
documents
Documents are useful if they produce output for a given IE task
Existing Approaches: Qxtract and
FactCrawl
•  Qxtract and FactCrawl learn from
a small document sample and so
exhibit far-from-perfect recall
•  FactCrawl re-ranks documents
using the quality of learned
queries
•  FactCrawl relies on queries
derived from small set of
documents and does not adapt to
new processed documents
Our Approach
•  Focus on the potentially useful documents for the extraction
task at hand
Ø  Learning to rank approach for document ranking
•  Results of extraction process form ever-expanding training set
Ø  Adaptive approach to update
document ranking continuously
Features:	
  Words	
  and	
  phrases,	
  and	
  agribute	
  values	
  
Document	
  
Collec+on	
  
f(di)	
  =	
  si	
  	
  	
   s1	
  ≥	
  s2	
  ≥	
  s3	
  ≥	
  ...	
  si	
  ≥	
  …	
  sn	
  
New	
  words:	
  
lava,	
  fissure	
  
Learning	
   Ranking	
  and	
  processing	
  
New	
  training	
  
instances	
  
Ranking Documents Adaptively for
IE
Learns	
  that	
  “tornado,”	
  “earthquake,”	
  
are	
  makers	
  of	
  useful	
  documents	
  
Learning	
  
Document	
  processing	
  and	
  update	
  
detec+on	
  	
  
Document	
  
Collec+on	
  
Useful	
  documents	
  but	
  on	
  
volcanoes,	
  not	
  yet	
  observed	
  
prominently	
  in	
  IE	
  process	
  
<tornado,	
  hawaii>	
  
“…	
  ‘Acermath’	
  narrates	
  the	
  story	
  of	
  
a	
  man	
  that	
  goes	
  missing…”	
  
“…	
  S+ll	
  recovering	
  from	
  an	
  earthquake,	
  
Chile	
  is	
  threatened	
  by	
  the	
  erupCon	
  of	
  
Copahue	
  volcano…”	
  
Online	
  relearning	
  
+	
  
Performs	
  online	
  learning	
  
New	
  informaCon	
  can	
  
poten+ally	
  help	
  improve	
  
ranking,	
  so	
  Update!	
  
<volcano,	
  chile>	
  
Learns	
  that	
  “volcano”	
  and	
  
“erup+on”	
  are	
  now	
  makers	
  of	
  
useful	
  documents	
  
Ranking	
  
adaptaCon	
  
Ranking Documents Adaptively for
IE: Key Ideas
•  To address efficiency, rely on online learning
–  Train the ranking model incrementally, one document at a time
•  To handle large (and expanding) feature sets, rely on in-training
feature selection
–  The learning-to-rank algorithm can efficiently identify the most
discriminative features during the training of the document ranking
model
•  We propose two learning-to-rank techniques for IE that
integrate online learning and in-training feature selection:
–  BAgg-IE
–  RSVM-IE
•  And two update detection techniques for document ranking
adaptation:
–  Top-K
–  Mod-C
Experimental Validation
1.  Impact of learning-to-rank approach
2.  Impact of sampling strategies
3.  Impact of adaptation
4.  Impact of update detection
5.  Scalability of our approach
6.  Comparison with the state-of-the-art
ranking strategies
30	
  
Experimental Validation
1.  Impact of learning-to-rank approach
2.  Impact of sampling strategies
3.  Impact of adaptation
4.  Impact of update detection
5.  Scalability of our approach
Ø Comparison with the state-of-the-art
ranking strategies
31	
  
Complex	
  extrac+on	
  systems:	
  	
  
CRFs,	
  SVM	
  kernels	
  	
  
Simple	
  extrac+on	
  systems:	
  	
  
HMMs,	
  text	
  pagerns	
  
•  Dataset:
•  Information extraction systems
1.8	
  million	
  ar+cles	
  from	
  1987-­‐2007	
  
Experimental Setup
The	
  Hai+	
  cholera	
  outbreak	
  between	
  2010	
  
and	
  2013	
  was	
  the	
  worst	
  epidemic	
  of	
  
cholera	
  in	
  recent	
  history	
  
Google	
  co-­‐founders	
  Larry	
  Page	
  and	
  
Sergey	
  Brin	
  recently	
  sat	
  down	
  with	
  
billionaire	
  venture	
  capitalist	
  Vinod	
  Khosla	
  
for	
  a	
  lengthy	
  interview.	
  
"This	
  is	
  not	
  a	
  vic+mless	
  crime,"	
  said	
  Jim	
  
Kendall,	
  president	
  of	
  the	
  Washington	
  
Associa+on	
  of	
  Internet	
  Service	
  Providers.	
  
A	
  fire	
  destroyed	
  a	
  Cargill	
  Meat	
  Solu+ons	
  beef	
  
processing	
  plant	
  in	
  Booneville.	
  
Disease	
   Time	
  Period	
  
cholera	
   between	
  2010	
  and	
  
2013	
  
Disease-­‐Outbreaks	
  
Person	
   OrganizaCon	
  
Larry	
  Page	
   Google	
  
Sergey	
  Brin	
   Google	
  
Person-­‐Organiza+on	
  	
  
Person	
   Career	
  
Jim	
  Kendall	
   President	
  
Person-­‐Career	
  	
  
Disaster	
   LocaCon	
  
fire	
   Booneville	
  
Man	
  Made	
  Disaster-­‐Loca+on	
  	
  
Other	
  rela+ons:	
  	
  
	
  	
  	
  	
  	
  Person-­‐Charge,	
  Elec+on-­‐Winner,	
  Natural	
  Disaster-­‐Loca+on	
  	
  
	
  
Dense	
  rela+ons	
   Sparse	
  rela+ons	
  
Recall Analysis
Update	
  detec+on:	
  
Mod-­‐C	
  
Our	
  adapCve	
  
implementa+on	
  of	
  
the	
  state	
  of	
  the	
  art	
  
Disease-­‐Outbreak	
  
•  Our techniques bring significant improvement
•  RSVM-IE performs best, as it prioritizes useful documents better, favoring
adaptation
Extraction Time
Person-­‐Organiza+on	
  Affilia+on	
  
•  Our techniques improve efficiency of process even for
inexpensive IE systems
Our	
  adapCve	
  
implementa+on	
  of	
  
the	
  state	
  of	
  the	
  art	
  
Conclusion: Document Ranking for
Scalable Information Extraction
•  Running IE system over large text
collections is computationally expensive
•  Proposed lightweight, adaptive approach
and learning-based alternatives
–  Online learning algorithms with in-
training feature selection: RSVM-IE,
Bagg-IE
–  Update detection based on feature
changes: Mod-C, Top-K
–  RSVM-IE + Mod-C performs best: Useful
documents are better prioritized, enabling
richer, more efficient ranking adaptation
Text	
  
Collec+on	
  
IE	
  system	
  
<tornado,	
  Florida>	
  
<volcano,	
  Chile>	
  
…	
  
Future Work: Ranking at Different
Granularities
•  Few collections on the Web
are relevant to an IE task
•  Prioritize them based on number of
useful documents
•  Few sentences in a text
document output tuples for an
IE task
•  Prioritize them based on usefulness
and diversity
Try REEL, our toolkit to easily develop and
evaluate IE systems
–  Open source and freely available at
http://reel.cs.columbia.edu
Before We Leave…
Thank	
  you!	
  

More Related Content

What's hot

Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
Anubhav Jain
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
aimsnist
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
Albert Bifet
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
Albert Bifet
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
openseesdays
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
Ian Foster
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
Anubhav Jain
 
STKO - A revolutionary toolkit for OpenSees
STKO - A revolutionary toolkit for OpenSeesSTKO - A revolutionary toolkit for OpenSees
STKO - A revolutionary toolkit for OpenSees
openseesdays
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
Anubhav Jain
 

What's hot (10)

Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?How might machine learning help advance solar PV research?
How might machine learning help advance solar PV research?
 
A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...A Framework and Infrastructure for Uncertainty Quantification and Management ...
A Framework and Infrastructure for Uncertainty Quantification and Management ...
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Introduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKennaIntroduction to OpenSees by Frank McKenna
Introduction to OpenSees by Frank McKenna
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
STKO - A revolutionary toolkit for OpenSees
STKO - A revolutionary toolkit for OpenSeesSTKO - A revolutionary toolkit for OpenSees
STKO - A revolutionary toolkit for OpenSees
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 

Similar to Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents

VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
YoungSeok Yoon
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
ChemAxon
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
SAIL_QU
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
The Statistical and Applied Mathematical Sciences Institute
 
Scientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & SociologyScientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & Sociology
Neil Chue Hong
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
Shah Zaib
 
Beyond Data: Delivering Machine Translation with Subject Matter Expertise
Beyond Data: Delivering Machine Translation with Subject Matter ExpertiseBeyond Data: Delivering Machine Translation with Subject Matter Expertise
Beyond Data: Delivering Machine Translation with Subject Matter Expertise
Iconic Translation Machines
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
Paolo Missier
 
Proof energy@work midih oc2-demo_day
Proof energy@work midih oc2-demo_dayProof energy@work midih oc2-demo_day
Proof energy@work midih oc2-demo_day
MIDIH_EU
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
Florian Lautenschlager
 
How to Improve Computer Vision with Geospatial Tools
How to Improve Computer Vision with Geospatial ToolsHow to Improve Computer Vision with Geospatial Tools
How to Improve Computer Vision with Geospatial Tools
Safe Software
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Lionel Briand
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
Paolo Missier
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
ISSEL
 
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
Databricks
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
Vincenzo Gulisano
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
Erdi Olmezogullari
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Paolo Missier
 

Similar to Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents (20)

VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' BacktrackingVL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
VL/HCC 2014 - A Longitudinal Study of Programmers' Backtracking
 
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
EUGM 2014 - Serge P. Parel (Exquiron): Farewell, PipelinePilot : Migrating th...
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
CLIM Program: Remote Sensing Workshop, An Introduction to Systems and Softwar...
 
Scientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & SociologyScientific Software: Sustainability, Skills & Sociology
Scientific Software: Sustainability, Skills & Sociology
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
 
Beyond Data: Delivering Machine Translation with Subject Matter Expertise
Beyond Data: Delivering Machine Translation with Subject Matter ExpertiseBeyond Data: Delivering Machine Translation with Subject Matter Expertise
Beyond Data: Delivering Machine Translation with Subject Matter Expertise
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Proof energy@work midih oc2-demo_day
Proof energy@work midih oc2-demo_dayProof energy@work midih oc2-demo_day
Proof energy@work midih oc2-demo_day
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
How to Improve Computer Vision with Geospatial Tools
How to Improve Computer Vision with Geospatial ToolsHow to Improve Computer Vision with Geospatial Tools
How to Improve Computer Vision with Geospatial Tools
 
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
Testing Dynamic Behavior in Executable Software Models - Making Cyber-physica...
 
Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
Enabling Physics and Empirical-Based Algorithms with Spark Using the Integrat...
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Design and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management SystemDesign and Implementation of A Data Stream Management System
Design and Implementation of A Data Stream Management System
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 

More from INRIA-OAK

Change Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic WebChange Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic Web
INRIA-OAK
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social Media
INRIA-OAK
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete data
INRIA-OAK
 
ANGIE in wonderland
ANGIE in wonderlandANGIE in wonderland
ANGIE in wonderland
INRIA-OAK
 
On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systems
INRIA-OAK
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
INRIA-OAK
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF Age
INRIA-OAK
 
Oak meeting 18/09/2014
Oak meeting 18/09/2014Oak meeting 18/09/2014
Oak meeting 18/09/2014
INRIA-OAK
 
Nautilus
NautilusNautilus
Nautilus
INRIA-OAK
 
Warg
WargWarg
Warg
INRIA-OAK
 
Vip2p
Vip2pVip2p
Vip2p
INRIA-OAK
 
S4
S4S4
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturator
INRIA-OAK
 
Rdf generator
Rdf generatorRdf generator
Rdf generator
INRIA-OAK
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
INRIA-OAK
 
rdf query reformulation
rdf query reformulationrdf query reformulation
rdf query reformulation
INRIA-OAK
 
postgres loader
postgres loaderpostgres loader
postgres loader
INRIA-OAK
 
Plreuse
PlreusePlreuse
Plreuse
INRIA-OAK
 
Paxquery
PaxqueryPaxquery
Paxquery
INRIA-OAK
 
Conjunctive queries
Conjunctive queriesConjunctive queries
Conjunctive queries
INRIA-OAK
 

More from INRIA-OAK (20)

Change Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic WebChange Management in the Traditional and Semantic Web
Change Management in the Traditional and Semantic Web
 
A Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social MediaA Network-Aware Approach for Searching As-You-Type in Social Media
A Network-Aware Approach for Searching As-You-Type in Social Media
 
Querying incomplete data
Querying incomplete dataQuerying incomplete data
Querying incomplete data
 
ANGIE in wonderland
ANGIE in wonderlandANGIE in wonderland
ANGIE in wonderland
 
On building more human query answering systems
On building more human query answering systemsOn building more human query answering systems
On building more human query answering systems
 
Dynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data PlatformsDynamically Optimizing Queries over Large Scale Data Platforms
Dynamically Optimizing Queries over Large Scale Data Platforms
 
Web Data Management in RDF Age
Web Data Management in RDF AgeWeb Data Management in RDF Age
Web Data Management in RDF Age
 
Oak meeting 18/09/2014
Oak meeting 18/09/2014Oak meeting 18/09/2014
Oak meeting 18/09/2014
 
Nautilus
NautilusNautilus
Nautilus
 
Warg
WargWarg
Warg
 
Vip2p
Vip2pVip2p
Vip2p
 
S4
S4S4
S4
 
Rdf saturator
Rdf saturatorRdf saturator
Rdf saturator
 
Rdf generator
Rdf generatorRdf generator
Rdf generator
 
Rdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimationRdf conjunctive query selectivity estimation
Rdf conjunctive query selectivity estimation
 
rdf query reformulation
rdf query reformulationrdf query reformulation
rdf query reformulation
 
postgres loader
postgres loaderpostgres loader
postgres loader
 
Plreuse
PlreusePlreuse
Plreuse
 
Paxquery
PaxqueryPaxquery
Paxquery
 
Conjunctive queries
Conjunctive queriesConjunctive queries
Conjunctive queries
 

Recently uploaded

LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
fermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptxfermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptx
ananya23nair
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
Scintica Instrumentation
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
Frédéric Baudron
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Sérgio Sacani
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
Sérgio Sacani
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
frank0071
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
Ritik83251
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
RDhivya6
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Selcen Ozturkcan
 

Recently uploaded (20)

LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
fermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptxfermented food science of sauerkraut.pptx
fermented food science of sauerkraut.pptx
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
(June 12, 2024) Webinar: Development of PET theranostics targeting the molecu...
 
AJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdfAJAY KUMAR NIET GreNo Guava Project File.pdf
AJAY KUMAR NIET GreNo Guava Project File.pdf
 
Farming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptxFarming systems analysis: what have we learnt?.pptx
Farming systems analysis: what have we learnt?.pptx
 
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
Candidate young stellar objects in the S-cluster: Kinematic analysis of a sub...
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Anti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark UniverseAnti-Universe And Emergent Gravity and the Dark Universe
Anti-Universe And Emergent Gravity and the Dark Universe
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
Juaristi, Jon. - El canon espanol. El legado de la cultura española a la civi...
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdfHUMAN EYE By-R.M Class 10 phy best digital notes.pdf
HUMAN EYE By-R.M Class 10 phy best digital notes.pdf
 
23PH301 - Optics - Optical Lenses.pptx
23PH301 - Optics  -  Optical Lenses.pptx23PH301 - Optics  -  Optical Lenses.pptx
23PH301 - Optics - Optical Lenses.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdfMending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
Mending Clothing to Support Sustainable Fashion_CIMaR 2024.pdf
 

Speeding up information extraction programs: a holistic optimizer and a learning-based approach to rank documents

  • 1. Speeding up Information Extraction Programs Helena Galhardas INESC-ID and IST, Universidade de Lisboa
  • 2. Who am I? •  Assistant Professor at IST, University of Lisbon (http:// www.ist.utl.pt) •  Researcher at INESC-ID (http://www.inesc-id.pt) •  Former (2000) PhD student at INRIA Rocquencourt/Univ. Versailles •  Main research interests: –  Data Cleaning •  Support for User Involvement in Data Cleaning, DaWaK 2011 •  Automatic optimization of the match operation (similarity join) –  Medical Information Systems •  A Machine Learning based Natural Language Interface for a Database of Medicines, poster paper DILS 2014 •  ProntApp: A Mobile Question Answering System for Medicines –  Information Extraction 4/3/15 2 2  
  • 3. Who am I? •  Assistant Professor at IST, University of Lisbon (http:// www.ist.utl.pt) •  Researcher at INESC-ID (http://www.inesc-id.pt) •  Former (2000) PhD student at INRIA Rocquencourt/Univ. Versailles •  Main research interests: –  Data Cleaning •  Support for User Involvement in Data Cleaning, DaWaK 2011 •  Automatic optimization of the match operation (similarity join) –  Medical Information Systems •  A Machine Learning based Natural Language Interface for a Database of Medicines, poster paper DILS 2014 •  ProntApp: A Mobile Question Answering System for Medicines Ø  Information Extraction 4/3/15 3 3  
  • 4. Agenda •  A Holistic Optimizer for Speeding up Information Extraction Programs –  PVLDB’13 and presented at VLDB’14 –  Joint work with Gonçalo Simões, INESC-ID and IST/Univ. Lisboa and Luis Gravano, Columbia University •  A Learning-based Approach to Rank Documents (briefly) –  EDBT’15 –  Joint work with Pablo Barrio, Columbia University, Gonçalo Simões and Luis Gravano 4  
  • 5. Information Extraction Discovers Structured Information In Text Documents The  erup+on  of    Grímsvötn    during  22-­‐25   May  2011  brought  back  the  memories  of   the  erup+ons  of  EyjaEallajökull  in  2010.   Natural  Disaster   erup+on  of    Grímsvötn   erup+ons  of  EyjaEallajökull   Temporal  Expression   22-­‐25  May  2011   2010   Temporal  Expression   22-­‐25  May  2011   Natural  Disaster   Temporal  Expression   erup+on  of    Grímsvötn   22-­‐25  May  2011   Extract   Occurs-­‐In   Extract   Natural  Disaster   Select   Temporal  Expressions   From  2011   Extract   Temporal  Expressions     Much  richer  querying   and  analysis   5  
  • 6. IE  SpecificaCon   Naïve  ExecuCon  Plan   Information Extraction Specification and Execution Plans are Graphs of Operators Type  of  Operator:   EE:  En+ty  Extractor   RE:  Rela+on  Extractor   Technique   Algorithm  
  • 7. Information Extraction is Challenging and Time-consuming •  Operates on a large set of features •  Relies on complex text analysis •  Often applied over large document collections erupCon   brought   memories   erupCons   erupCon   bring   memory   erupCon   Grímsvötn   22-­‐25   brought   2010   Cc+   N+.N+   c+   N+   Ccccccccc   NN.NN   ccccccc   NNNN   The   erupCon   brought   back   the   memories   the   erupCons   2010   det   nsubj   prt   det   dobj   prep_of   prep_of  det   LemmaCzaCon  Features   Character  CapitalizaCon  Features   7  
  • 8. IE Implementation Choices to the Rescue •  These implementation choices have an impact on execution time and quality of results (recall and precision) –  Recall: fraction of correct tuples that the plan produces –  Precision: fraction of produced tuples that are correct D A   C B D1   A1   A2   A3   C1   B1   B2   Choosing  an  Algorithm  for  Each  Operator   Choosing  Operator  ExecuCon  Order   A   D C B A   D C B A   D C B A2   D1   C1   B1   ExecuCon  Plan   Query(210K  docs)   EECRF   EECRF              Viterbi   EECRF      VBS-­‐10   EECRF      VBS-­‐5   RESVM   RESVM              All  SVM   Choosing  Document  Retrieval  Strategy   Query  for   Relevant  Docs   Scan  whole   CollecCon   1.  Execute  A   2.  Find  sentences  that   produced  results  for  A.   3.  Execute  B  with  those   sentences  only   Query  for:   [“natural  disaster”],   [earthquake],   [erup+on],   [tornado],  (…)   8  
  • 9. State-of-the-art IE Optimization Systems D A   C B D1   A1   A2   A3   C1   B1   B2   Choosing  an  Algorithm  for  Each  Operator   Choosing  Operator  ExecuCon  Order   A   D C B A   D C B A   D C B SystemT   CIMPLE   ¡  Both use a cost-based optimizer to choose the execution order of operators ¡  System T uses heuristics to choose among different algorithms for an operator ¡  Do not allow approximate results Choosing  Document  Retrieval  Strategy   Query  for   Relevant  Docs   Scan  whole   CollecCon  
  • 10. •  An information extraction optimizer should make these implementation choices collectively! D A   C B D1   A1   A2   A3   C1   B1   B2   Choosing  an  Algorithm  for  Each  Operator   Choosing  Operator  ExecuCon  Order   A   D C B A   D C B A   D C B SQoUT   Choosing  Document  Retrieval  Strategy   Query  for   Relevant  Docs   Scan  whole   CollecCon   10   State-of-the-art IE Optimization Systems
  • 11. Making implementation choices collectively leads to faster plans A   D C B A1   D1   C1   B1   Scan(1M  docs)   Time:   7h  30  m   Recall:   93%   Precision:   100%   A3   D1   C1   B2   Scan(1M  docs)   Time:   9h  23  m   Recall:   91%   Precision:   95%   A1   D1   C1   B1   Scan(1M  docs)   Time:   13h  16  m   Recall:   100%   Precision:   100%   Naive  Plan   A1   D1   C1   B1   Query(153K  docs)   Time:   2h  07  m   Recall:   50%   Precision:   100%   •  By combining multiple choices we get a faster plan •  We only control the quality results with collective optimization Algorithms   OpCmizaCon   Operator  Order   OpCmizaCon   Document  Retrieval   OpCmizaCon   A3   D1   C1   B2   Query(210K  docs)   Time:   1h  31  m   Recall:   50%   Precision:   96%   CollecCvely  OpCmized  Plan   A3   D1   C1   B2   Query(153K  docs)   Time:   57  m   Recall:   39%   Precision:   98%   Direct  CombinaCon  Plan   Hard  to  control  the  expected   recall  and  precision   Precision  >=  90%   Recall  >=  50%   Recall  and  Precision   Constraints  
  • 12. Holistic Optimizer for Information Extraction A2   D1   C1   B1   Query(210K  docs)   Time:   43  m   Recall:   50%   Precision:   95%   Best  Plan   Document  CollecCon   A   IE  SpecificaCon   Precision  >=  90%  Recall  >=  50%   Recall  and  Precision  Constraints   A3   D1   C1   B2   A2   D1   C1   B1   A2   D1   C1   B1   Query (210K docs) Query (210K docs) Query (210K docs) Candidate Plans Best Plan Document   Sample   Predictor   Parameters   Recall  and   Precision   Predictor   Time: 57 m Recall: 39% Precision: 78% A2   D1   C1   B1   Time: 1h 24 m Recall: 98% Precision: 95%Query A3   D1   C1   B2   Query HolisCc   OpCmizer   (…)   EnumeraCon  of  ExecuCon  Plans   A2   D1   C1   B1   A3   D1   C1   B2  
  • 13. Predictor Parameters Estimation •  Predictor parameters are determined from a document sample •  Three phases: 1.  Sampling to select the subset of input documents to use in the estimation (stratified sampling) 2.  Extraction to retrieve tuples from the sample using the plans produced by the enumeration step (dynamic programming) 3.  Estimation to determine the optimizer parameters using the results of 2) (Maximum-a-posteriori estimation) Document  Sample   Predictor  Parameters   Recall  and  Precision   Predictor  
  • 14. Extraction Quality Prediction •  Based on two predictions: – Number of input tuples of operator – Number of output tuples of operator A2   D1   C1   (…)   (…)   Path  1   Path  2   D1   t1   t2   t3   t4   (…)   •   •   •   •   •   •   •   •   •   •   A   A2  A1   A3   t1   t2   t3   (…)   t1   t2   t1   t1   t3  
  • 15. Experimental Validation •  Impact of the Sample Size and Prediction Quality •  Impact of the Parameter Estimation Strategy •  Impact of Individual Implementation Choices •  Impact of Precision Constraints •  Comparison With the State-of-the-Art Techniques 15  
  • 16. Experimental Validation •  Impact of the Sample Size and Prediction Quality •  Impact of the Parameter Estimation Strategy Ø Impact of Individual Implementation Choices •  Impact of Precision Constraints Ø Comparison With the State-of-the-Art Techniques 16  
  • 17. Experimental Setup •  Datasets •  Information Extraction Programs Researchers’  Homepages   Kedar  Bellare’  Homepage     PhD  student  at  University  of   Massachusegs,  Amherst  under   supervision  of  Andrew   McCallum.       Publica+ons     Generalized  Expecta+on  Criteria   for  Bootstrapping  Extractors   using  Record-­‐Text  Alignment,   EMNLP  2009,  August  2009   Advisor   Advisee   Andrew   McCallum   Kedar  Bellare   Conference   Date   EMNLP  2009   August  2009   Advises   ConferencesDates   The  Hai+  cholera  outbreak   between  2010  and  2013  was  the   worst  epidemic  of  cholera  in   recent  history   Google  co-­‐founders  Larry  Page   and  Sergey  Brin  recently  sat   down  with  billionaire  venture   capitalist  Vinod  Khosla  for  a   lengthy  interview.   Disease   Time  Period   Cholera   Between  2010  and   2013   DiseaseOutbreaks   Person   OrganizaCon   Larry  Page   Google   Sergey  Brin   Google   Organiza+onAffilia+on   Sparse  RelaCons   Dense  RelaCons   17  
  • 18. Experimental Setup (cont.) •  IE Optimization Systems (baselines): –  Cimple –  System T –  SQoUT –  SQoUt-Boosted: variation of SQoUT that uses the execution plans produced by Cimple and System T •  IE Optimization Techniques: –  Optimized-Ret: implementation choice of retrieval strategy –  Optimized-Alg: implementation choice of the algorithms –  Optimized-Order: implementation choice of the operator execution order –  Holistic-Map: uses maximum a posteriori for estimating the parameters –  Holistic-ML: uses maximum likelihood for estimating the parameters 18  
  • 19. Impact of Individual Implementation Choices •  Holistic-MAP obtains the fastest plan in every case •  Optimized-Ret becomes too slow when desired recall increases •  Optimized-Order and Optimized-Alg consistently produce slow plans •  Holistic optimization outperforms individual optimization ConferencesDates   19  
  • 20. Comparison with State-of-the-Art Techniques •  For dense relations, Holistic-MAP –  Obtains faster plans –  Closely matches recall constraints OrganizaConAffiliaCon   20  
  • 21. Comparison with State-of-the-Art Techniques •  For sparse relations, Holistic-MAP –  Obtains faster plans –  Does not always matches recall constraints –  Can correct the errors in recall constraints as the target recall increases DiseaseOutbreaks   21  
  • 22. Conclusions: Holistic Optimizer •  IE programs can be optimized along multiple interrelated dimensions, namely, the choice of: –  Algorithms for each extraction operator –  Operator execution order –  Document retrieval strategy •  We presented the first holistic optimization approach for IE that collectively optimizes all dimensions simultaneously –  It outperforms the state-of-the-art techniques and selects efficient plans that meet target recall and precision values 22  
  • 23. What Comes Next? •  Ranking Models for IE (second part of the talk) •  Distributed Executions for IE (on-going work) –  by determining document placement in distributed file system Random   order   Ranked   order  
  • 24. Agenda •  A Holistic Optimizer for Speed up Information Extraction Programs –  PVLDB’13 and presented at VLDB’14 –  Joint work with Gonçalo Simões and Luis Gravano •  A learning-based Approach to Rank Documents –  EDBT’15 –  Joint work with Pablo Barrio, Gonçalo Simões and Luis Gravano 24  
  • 25. Reducing IE Processing Time •  Small, topic-specific fraction of the entire collection is useful Ex: Only 2% of documents in a New York Times archive, mostly environment-related, are useful for Natural Disaster- Location with a state-of-the-art IE system •  Useful documents share distinctive words and phrases Ex: “earthquake”, “storm,” “richter,” “volcano eruption” for Natural Disaster-Location •  Information extraction system labels documents as useful or not, for free Should focus extraction over these documents and ignore rest Should learn to distinguish useful documents for an IE task Could use the result of an IE process to generate an ever- expanding training set for learning and further identifying useful documents Documents are useful if they produce output for a given IE task
  • 26. Existing Approaches: Qxtract and FactCrawl •  Qxtract and FactCrawl learn from a small document sample and so exhibit far-from-perfect recall •  FactCrawl re-ranks documents using the quality of learned queries •  FactCrawl relies on queries derived from small set of documents and does not adapt to new processed documents
  • 27. Our Approach •  Focus on the potentially useful documents for the extraction task at hand Ø  Learning to rank approach for document ranking •  Results of extraction process form ever-expanding training set Ø  Adaptive approach to update document ranking continuously Features:  Words  and  phrases,  and  agribute  values   Document   Collec+on   f(di)  =  si       s1  ≥  s2  ≥  s3  ≥  ...  si  ≥  …  sn   New  words:   lava,  fissure   Learning   Ranking  and  processing   New  training   instances  
  • 28. Ranking Documents Adaptively for IE Learns  that  “tornado,”  “earthquake,”   are  makers  of  useful  documents   Learning   Document  processing  and  update   detec+on     Document   Collec+on   Useful  documents  but  on   volcanoes,  not  yet  observed   prominently  in  IE  process   <tornado,  hawaii>   “…  ‘Acermath’  narrates  the  story  of   a  man  that  goes  missing…”   “…  S+ll  recovering  from  an  earthquake,   Chile  is  threatened  by  the  erupCon  of   Copahue  volcano…”   Online  relearning   +   Performs  online  learning   New  informaCon  can   poten+ally  help  improve   ranking,  so  Update!   <volcano,  chile>   Learns  that  “volcano”  and   “erup+on”  are  now  makers  of   useful  documents   Ranking   adaptaCon  
  • 29. Ranking Documents Adaptively for IE: Key Ideas •  To address efficiency, rely on online learning –  Train the ranking model incrementally, one document at a time •  To handle large (and expanding) feature sets, rely on in-training feature selection –  The learning-to-rank algorithm can efficiently identify the most discriminative features during the training of the document ranking model •  We propose two learning-to-rank techniques for IE that integrate online learning and in-training feature selection: –  BAgg-IE –  RSVM-IE •  And two update detection techniques for document ranking adaptation: –  Top-K –  Mod-C
  • 30. Experimental Validation 1.  Impact of learning-to-rank approach 2.  Impact of sampling strategies 3.  Impact of adaptation 4.  Impact of update detection 5.  Scalability of our approach 6.  Comparison with the state-of-the-art ranking strategies 30  
  • 31. Experimental Validation 1.  Impact of learning-to-rank approach 2.  Impact of sampling strategies 3.  Impact of adaptation 4.  Impact of update detection 5.  Scalability of our approach Ø Comparison with the state-of-the-art ranking strategies 31  
  • 32. Complex  extrac+on  systems:     CRFs,  SVM  kernels     Simple  extrac+on  systems:     HMMs,  text  pagerns   •  Dataset: •  Information extraction systems 1.8  million  ar+cles  from  1987-­‐2007   Experimental Setup The  Hai+  cholera  outbreak  between  2010   and  2013  was  the  worst  epidemic  of   cholera  in  recent  history   Google  co-­‐founders  Larry  Page  and   Sergey  Brin  recently  sat  down  with   billionaire  venture  capitalist  Vinod  Khosla   for  a  lengthy  interview.   "This  is  not  a  vic+mless  crime,"  said  Jim   Kendall,  president  of  the  Washington   Associa+on  of  Internet  Service  Providers.   A  fire  destroyed  a  Cargill  Meat  Solu+ons  beef   processing  plant  in  Booneville.   Disease   Time  Period   cholera   between  2010  and   2013   Disease-­‐Outbreaks   Person   OrganizaCon   Larry  Page   Google   Sergey  Brin   Google   Person-­‐Organiza+on     Person   Career   Jim  Kendall   President   Person-­‐Career     Disaster   LocaCon   fire   Booneville   Man  Made  Disaster-­‐Loca+on     Other  rela+ons:              Person-­‐Charge,  Elec+on-­‐Winner,  Natural  Disaster-­‐Loca+on       Dense  rela+ons   Sparse  rela+ons  
  • 33. Recall Analysis Update  detec+on:   Mod-­‐C   Our  adapCve   implementa+on  of   the  state  of  the  art   Disease-­‐Outbreak   •  Our techniques bring significant improvement •  RSVM-IE performs best, as it prioritizes useful documents better, favoring adaptation
  • 34. Extraction Time Person-­‐Organiza+on  Affilia+on   •  Our techniques improve efficiency of process even for inexpensive IE systems Our  adapCve   implementa+on  of   the  state  of  the  art  
  • 35. Conclusion: Document Ranking for Scalable Information Extraction •  Running IE system over large text collections is computationally expensive •  Proposed lightweight, adaptive approach and learning-based alternatives –  Online learning algorithms with in- training feature selection: RSVM-IE, Bagg-IE –  Update detection based on feature changes: Mod-C, Top-K –  RSVM-IE + Mod-C performs best: Useful documents are better prioritized, enabling richer, more efficient ranking adaptation Text   Collec+on   IE  system   <tornado,  Florida>   <volcano,  Chile>   …  
  • 36. Future Work: Ranking at Different Granularities •  Few collections on the Web are relevant to an IE task •  Prioritize them based on number of useful documents •  Few sentences in a text document output tuples for an IE task •  Prioritize them based on usefulness and diversity
  • 37. Try REEL, our toolkit to easily develop and evaluate IE systems –  Open source and freely available at http://reel.cs.columbia.edu Before We Leave… Thank  you!