SlideShare a Scribd company logo
Information Extraction
from Web-Scale N-Gram Data
Niket Tandon and Gerard de Melo
2010-07-23
Max Planck Institute for Informatics
Saarbr¨ucken, Germany
1 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
2 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Other Applications
Query expansion
Semantic analysis
Faceted search
Entity Tracking
Document Enrichment
Mobile Services
Visual Object Recognition
etc.
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Introduction
and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my
lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to
overlooking the silver river. All round her the grass stretched green, but stunted, browning in the
the ground steadied beneath them, and the grass turned green, swishing high around their
to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the
are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but
in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor,
however, each bank is lined with stands of grass that remain green and stand taller than the
groaned and farted and schemed for snatches of grass that showed green at the corners of his bits,
the flowers were blossoming profusely and the grass was richly green. The people of the village
Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f]
hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my
Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's
Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as
be beautiful on there really beautiful. All the grass lush and green not a car parked on it
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Structured Data
isA(Guggenheim,Museum)
locatedIn(Guggenheim,Manhattan)
partOf(Manhattan,NewYork)
. . .
4 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
e.g. “<X> and other <Y>”
“Lausanne and other cities” isA(Lausanne,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
One Possibility: Sophisticated NLP (1990s)
MUC evaluation initiative
CRF-style segmentation methods
etc.
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
Agichtein (2005), Pantel (2004): scalable IE, but still only a
small fraction of the entire Web
6 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
7 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
7 / 27
Information Extraction from Web-Scale N-Gram Data
Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
Instead
Use n-gram statistics derived
from very large parts of the
Web!
7 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
8 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
9 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
Provides: Frequencies/Language model for strings
Example: f(“cities such as Geneva”)=...
f(“Z¨urich and other cities”)=...
f(“Lausanne and other Swiss cities”)=...
9 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
ok:
if independently extractable, e.g. founding year and location of
organization
not ok:
“<V> imported <W> dollars worth of <X> from <Y>
in year <Z>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
not:
fatherOf(Wolfgang Amadeus
Mozart,F. X. Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
no way:
fatherOf(Johannes
Chrysostomus Wolfgangus
Theophilus Mozart,
Franz Xaver Wolfgang
Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
short patterns
ok:
“<X> and other <Y>”
not:
“<X> has an inflation rate of <Y>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
Less context information (WSD, POS tagging, parsing)
11 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
12 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
availability
larger than available document collections
crawling the Web: slow, requires link farm detection, high
bandwidth
12 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
e.g. for isA relation: (dogs,animals), (gold,metal)
e.g. for partOf: (finger,hand), (leaves,trees),
(windows,houses)
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
query n-gram dataset: “dogs * animals” (and “animals * dogs”)
alternatively: “dogs ? animals”, “dogs ? ? animals”, . . .
alternatively: fall back to separate document collection
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
(dogs,animals) found in
“.... dogs and other animals ...”
“<X> and other <Y>”
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
“<X> and other <Y>” finds
(Z¨urich,cities) “Z¨urich and other cities”
(apples,fruits) “apples and other fruits”
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Supervised learning based on labeled set of tuples
Output: Accepted tuples like (Geneva,city).
13 / 27
Information Extraction from Web-Scale N-Gram Data
N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Features: for a tuple (x, y)
fi (p(x, y)) for each datasource i and pattern p
p∈P
fi (p(x, y)) for each datasource i
13 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
14 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
negative: cut-off frequency 40
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
WSDL-based web service
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
3 ClueWeb09 5-grams
500 million web pages, 700M 5-grams
15 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Seeds and Patterns
Patterns
Relation Seeds discovered
isA 100 2991
partOf 100 3883
hasProperty 100 3175
seeds from MIT ConceptNet
even among highest-ranked:
partOf(children,parents) and isA(winning,everything)
16 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern Examples: isA
Pattern PMI range
<X> and almost any <Y> high
<X> betting basketball betting <Y> high
<X> is my favorite <Y> high
<X> shoes online shoes <Y> high
<X> is a <Y> medium
<X> is the best <Y> medium
<X> or any other <Y> medium
<X> , and <Y> medium
<X> and other smart <Y> medium
<X> and grammar <Y> low
<X> content of the <Y> low
<X> when it changes <Y> low
17 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern Examples: partOf
Pattern PMI range
<X> with the other <Y> high
<X> of the top <Y> high
<X> online <Y> high
<X> shoes online shoes <Y> high
<X> from the <Y> medium
<X> or even entire <Y> medium
<X> of host <Y> medium
<X> from <Y> medium
<X> of a different <Y> medium
<X> entertainment and <Y> low
<X> Download for thou <Y> low
<X> company home in <Y> low
18 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Pattern: Microsoft Document Body 3-
grams vs. Anchor 3-grams
(each point represents the sum of pattern scores for a tuple)
19 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Patterns: Microsoft Document Body 3-
grams vs. Title 3-grams
(each point represents the sum of pattern scores for a tuple)
20 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Patterns: Microsoft Document Body 3-
grams vs. Google Body 3-grams
(each point represents the sum of pattern scores for a tuple)
21 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
=⇒ Recall is relative to union of pattern matches
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
linguistic information implicitly captured via combinations of
patterns!
22 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Google 3-grams Document Body 55.9% 38.5% 45.6%
Google 4-grams Document Body 52.6% 43.3% 47.5%
Google 5-grams Document Body 48.1% 42.8% 45.3%
ClueWeb 5-grams Document Body 51.7% 35.6% 42.2%
Google 3-/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3-/4-/5-
grams
Document Body 58.7% 43.8% 50.1%
23 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Microsoft 3-grams Document Body 58.5% 33.2% 42.3%
Microsoft 3-grams Document Title 51.7% 29.8% 37.8%
Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2%
Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5%
Google 3-grams Document Body 55.9% 38.5% 45.6%
Microsoft 3/4-
grams
Body (3-grams only) /
Title / Anchor
40.5% 98.1% 57.3%
Google 3/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3/4/5-
grams
Document Body 58.7% 43.8% 50.1%
All 3/4/5-
grams
Body / Title / Anchor 80.5% 34.0% 47.8%
24 / 27
Information Extraction from Web-Scale N-Gram Data
Experiments
Example: hasProperty
Properties of “flowers”
25 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
26 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
27 / 27
Information Extraction from Web-Scale N-Gram Data
Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
diversity of data sources helps
27 / 27
Information Extraction from Web-Scale N-Gram Data

More Related Content

What's hot

Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Fabio Benedetti
 
Qald 7 at ESWC2017
Qald 7 at ESWC2017Qald 7 at ESWC2017
Qald 7 at ESWC2017
Giulio Napolitano
 
QALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data ChallengeQALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data Challenge
Holistic Benchmarking of Big Linked Data
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)
Svein-Magnus Sørensen
 
Learning-based Data Cleaning
Learning-based Data CleaningLearning-based Data Cleaning
Learning-based Data Cleaning
Christian Stade-Schuldt
 
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Georg Vogeler
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Fabien Gandon
 

What's hot (8)

Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
Qald 7 at ESWC2017
Qald 7 at ESWC2017Qald 7 at ESWC2017
Qald 7 at ESWC2017
 
QALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data ChallengeQALD-7 Question Answering over Linked Data Challenge
QALD-7 Question Answering over Linked Data Challenge
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)Open Data - a goldmine (JavaZone 2009)
Open Data - a goldmine (JavaZone 2009)
 
Learning-based Data Cleaning
Learning-based Data CleaningLearning-based Data Cleaning
Learning-based Data Cleaning
 
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Standing-off Trees and Graphs : on the affordance of technologies for the edi...
Standing-off Trees and Graphs : on the affordance of technologies for the edi...
 
Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...Web open standards for linked data and knowledge graphs as enablers of EU dig...
Web open standards for linked data and knowledge graphs as enablers of EU dig...
 

Viewers also liked

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Yunyao Li
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
Tommaso Teofili
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
Rubén Izquierdo Beviá
 
Using the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionUsing the Web of Data for Information Extraction
Using the Web of Data for Information Extraction
Benjamin Adrian
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collections
rvernica
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
PROJECT CONSULT Unternehmensberatung Dr. Ulrich Kampffmeyer GmbH
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
SriTeja Allaparthi
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
Ankush Jain
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
Carlos Castillo (ChaTo)
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosisask2372
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalChen Xi
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
Ankit Sharma
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2ndhit_alex
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and Extraction
Christopher Frenz
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
Deeksha thakur
 

Viewers also liked (20)

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Using the Web of Data for Information Extraction
Using the Web of Data for Information ExtractionUsing the Web of Data for Information Extraction
Using the Web of Data for Information Extraction
 
Efficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String CollectionsEfficient Top-k Algorithms for Fuzzy Search in String Collections
Efficient Top-k Algorithms for Fuzzy Search in String Collections
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and Extraction
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 

Similar to Information Extraction from Web-Scale N-Gram Data

big data and data warehouse unit 1 for college
big data and data warehouse unit 1 for collegebig data and data warehouse unit 1 for college
big data and data warehouse unit 1 for college
CHOLMALUAL
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
Kamal Singh Lodhi
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domain
Claudia Vitolo
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
Bowo Prasetyo
 
English 103 Final TestReading Poetry for 10Answer all five
English 103 Final TestReading Poetry for 10Answer all five English 103 Final TestReading Poetry for 10Answer all five
English 103 Final TestReading Poetry for 10Answer all five
TanaMaeskm
 
Big data
Big dataBig data
PRTR Open Data Sources
PRTR Open Data SourcesPRTR Open Data Sources
PRTR Open Data Sources
José Félix Ontañón Carmona
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
J T "Tom" Johnson
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
Paolo Missier
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
Christoforos Anagnostopoulos
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Drowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research fundingDrowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research funding
Andrea Scharnhorst
 
Shared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to educationShared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to education
Mathieu d'Aquin
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
Sean Davis
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
Vikas Thada
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
PritiRishi
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
Albert Bifet
 
Ways to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is ScarseWays to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is Scarse
Zia Babar
 
Where is the World is my Open Government Data?
Where is the World is my Open Government Data?Where is the World is my Open Government Data?
Where is the World is my Open Government Data?
Rensselaer Polytechnic Institute
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
Ian Foster
 

Similar to Information Extraction from Web-Scale N-Gram Data (20)

big data and data warehouse unit 1 for college
big data and data warehouse unit 1 for collegebig data and data warehouse unit 1 for college
big data and data warehouse unit 1 for college
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Improving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domainImproving access to geospatial Big Data in the hydrology domain
Improving access to geospatial Big Data in the hydrology domain
 
Overview of Data Mining
Overview of Data MiningOverview of Data Mining
Overview of Data Mining
 
English 103 Final TestReading Poetry for 10Answer all five
English 103 Final TestReading Poetry for 10Answer all five English 103 Final TestReading Poetry for 10Answer all five
English 103 Final TestReading Poetry for 10Answer all five
 
Big data
Big dataBig data
Big data
 
PRTR Open Data Sources
PRTR Open Data SourcesPRTR Open Data Sources
PRTR Open Data Sources
 
It's not the documents; it's the DATA
It's not the documents; it's the DATAIt's not the documents; it's the DATA
It's not the documents; it's the DATA
 
The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...The lifecycle of reproducible science data and what provenance has got to do ...
The lifecycle of reproducible science data and what provenance has got to do ...
 
Why Data Science is a Science
Why Data Science is a ScienceWhy Data Science is a Science
Why Data Science is a Science
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Drowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research fundingDrowning in information – the need of macroscopes for research funding
Drowning in information – the need of macroscopes for research funding
 
Shared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to educationShared data infrastructures from smart cities to education
Shared data infrastructures from smart cities to education
 
2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis2016 07 12_purdue_bigdatainomics_seandavis
2016 07 12_purdue_bigdatainomics_seandavis
 
Big dataorig
Big dataorigBig dataorig
Big dataorig
 
Data Mining introduction and basic concepts
Data Mining introduction and basic conceptsData Mining introduction and basic concepts
Data Mining introduction and basic concepts
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Ways to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is ScarseWays to Extract Variable Insights when Data is Scarse
Ways to Extract Variable Insights when Data is Scarse
 
Where is the World is my Open Government Data?
Where is the World is my Open Government Data?Where is the World is my Open Government Data?
Where is the World is my Open Government Data?
 
Open Analytics Environment
Open Analytics EnvironmentOpen Analytics Environment
Open Analytics Environment
 

More from Gerard de Melo

SEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link PredictionSEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link Prediction
Gerard de Melo
 
How to Manage your Research
How to Manage your ResearchHow to Manage your Research
How to Manage your Research
Gerard de Melo
 
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood NarrativesKnowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
Gerard de Melo
 
Learning Multilingual Semantics from Big Data on the Web
Learning Multilingual Semantics from Big Data on the WebLearning Multilingual Semantics from Big Data on the Web
Learning Multilingual Semantics from Big Data on the Web
Gerard de Melo
 
From Big Data to Valuable Knowledge
From Big Data to Valuable KnowledgeFrom Big Data to Valuable Knowledge
From Big Data to Valuable Knowledge
Gerard de Melo
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
Gerard de Melo
 
Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)
Gerard de Melo
 
From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
Gerard de Melo
 
UWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge BaseUWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge Base
Gerard de Melo
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
Gerard de Melo
 
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel CorporaExtracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Gerard de Melo
 
Towards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined EvidenceTowards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined Evidence
Gerard de Melo
 
Not Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked DataNot Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked Data
Gerard de Melo
 
Good, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic IntensitiesGood, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic Intensities
Gerard de Melo
 
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged OntologyYAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
Gerard de Melo
 

More from Gerard de Melo (15)

SEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link PredictionSEMAC Graph Node Embeddings for Link Prediction
SEMAC Graph Node Embeddings for Link Prediction
 
How to Manage your Research
How to Manage your ResearchHow to Manage your Research
How to Manage your Research
 
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood NarrativesKnowlywood: Mining Activity Knowledge from Hollywood Narratives
Knowlywood: Mining Activity Knowledge from Hollywood Narratives
 
Learning Multilingual Semantics from Big Data on the Web
Learning Multilingual Semantics from Big Data on the WebLearning Multilingual Semantics from Big Data on the Web
Learning Multilingual Semantics from Big Data on the Web
 
From Big Data to Valuable Knowledge
From Big Data to Valuable KnowledgeFrom Big Data to Valuable Knowledge
From Big Data to Valuable Knowledge
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)Searching the Web of Data (Tutorial)
Searching the Web of Data (Tutorial)
 
From Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated DataFrom Linked Data to Tightly Integrated Data
From Linked Data to Tightly Integrated Data
 
UWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge BaseUWN: A Large Multilingual Lexical Knowledge Base
UWN: A Large Multilingual Lexical Knowledge Base
 
Multilingual Text Classification using Ontologies
Multilingual Text Classification using OntologiesMultilingual Text Classification using Ontologies
Multilingual Text Classification using Ontologies
 
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel CorporaExtracting Sense-Disambiguated Example Sentences From Parallel Corpora
Extracting Sense-Disambiguated Example Sentences From Parallel Corpora
 
Towards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined EvidenceTowards a Universal Wordnet by Learning from Combined Evidence
Towards a Universal Wordnet by Learning from Combined Evidence
 
Not Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked DataNot Quite the Same: Identity Constraints for the Web of Linked Data
Not Quite the Same: Identity Constraints for the Web of Linked Data
 
Good, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic IntensitiesGood, Great, Excellent: Global Inference of Semantic Intensities
Good, Great, Excellent: Global Inference of Semantic Intensities
 
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged OntologyYAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged Ontology
 

Recently uploaded

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 

Recently uploaded (20)

哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 

Information Extraction from Web-Scale N-Gram Data

  • 1. Information Extraction from Web-Scale N-Gram Data Niket Tandon and Gerard de Melo 2010-07-23 Max Planck Institute for Informatics Saarbr¨ucken, Germany 1 / 27 Information Extraction from Web-Scale N-Gram Data
  • 2. Information Extraction Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 2 / 27 Information Extraction from Web-Scale N-Gram Data
  • 3. Information Extraction Introduction Information Extraction Users generally want information, not documents 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 4. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 5. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 6. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 7. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 8. Information Extraction Introduction Other Applications Query expansion Semantic analysis Faceted search Entity Tracking Document Enrichment Mobile Services Visual Object Recognition etc. Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 9. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 10. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 11. Information Extraction Introduction Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 12. Information Extraction Introduction and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to overlooking the silver river. All round her the grass stretched green, but stunted, browning in the the ground steadied beneath them, and the grass turned green, swishing high around their to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor, however, each bank is lined with stands of grass that remain green and stand taller than the groaned and farted and schemed for snatches of grass that showed green at the corners of his bits, the flowers were blossoming profusely and the grass was richly green. The people of the village Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f] hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as be beautiful on there really beautiful. All the grass lush and green not a car parked on it Information Extraction Users generally want information, not documents Structured data Direct, instant answers ... and more Where do we obtain such data? 3 / 27 Information Extraction from Web-Scale N-Gram Data
  • 13. Information Extraction How do we get Structured Data? Structured Data isA(Guggenheim,Museum) locatedIn(Guggenheim,Manhattan) partOf(Manhattan,NewYork) . . . 4 / 27 Information Extraction from Web-Scale N-Gram Data
  • 14. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 15. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) e.g. “<Y> such as <X>” “cities such as Salem” isA(Salem,City) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 16. Information Extraction How do we get Structured Data? Pattern-Based Approaches Use simple textual patterns to extract information (Lyons 1977, Cruse 1986, Hearst 1992) e.g. “<Y> such as <X>” “cities such as Salem” isA(Salem,City) e.g. “<X> and other <Y>” “Lausanne and other cities” isA(Lausanne,City) 5 / 27 Information Extraction from Web-Scale N-Gram Data
  • 17. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 18. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection One Possibility: Sophisticated NLP (1990s) MUC evaluation initiative CRF-style segmentation methods etc. 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 19. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 20. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 21. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words English Wikipedia: 1 000 million words 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 22. Information Extraction How do we get Structured Data? Problem: Pattern Matches are Rare Hearst found only 46 facts in 20 million word New York Times article collection Alternative: Use Larger Corpora American National Corpus: 22 million words British National Corpus: 100 million words English Wikipedia: 1 000 million words Agichtein (2005), Pantel (2004): scalable IE, but still only a small fraction of the entire Web 6 / 27 Information Extraction from Web-Scale N-Gram Data
  • 23. Information Extraction Web Search Engines 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 24. Information Extraction Web Search Engines Problems Need to know what you’re looking for. Can only retrieve top-k results Very slow: days instead of minutes – Cafarella (2005) 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 25. Information Extraction Web Search Engines Problems Need to know what you’re looking for. Can only retrieve top-k results Very slow: days instead of minutes – Cafarella (2005) Instead Use n-gram statistics derived from very large parts of the Web! 7 / 27 Information Extraction from Web-Scale N-Gram Data
  • 26. N-Gram Information Extraction Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 8 / 27 Information Extraction from Web-Scale N-Gram Data
  • 27. N-Gram Information Extraction N-Gram Data Web-Scale N-Gram Datasets Web-scale n-gram statistics derived from around 1012 words of text are available 9 / 27 Information Extraction from Web-Scale N-Gram Data
  • 28. N-Gram Information Extraction N-Gram Data Web-Scale N-Gram Datasets Web-scale n-gram statistics derived from around 1012 words of text are available Provides: Frequencies/Language model for strings Example: f(“cities such as Geneva”)=... f(“Z¨urich and other cities”)=... f(“Lausanne and other Swiss cities”)=... 9 / 27 Information Extraction from Web-Scale N-Gram Data
  • 29. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities ok: if independently extractable, e.g. founding year and location of organization not ok: “<V> imported <W> dollars worth of <X> from <Y> in year <Z>” 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 30. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest ok: birthYear(Mozart,1756) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 31. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest ok: birthYear(Mozart,1756) not: fatherOf(Wolfgang Amadeus Mozart,F. X. Mozart) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 32. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest no way: fatherOf(Johannes Chrysostomus Wolfgangus Theophilus Mozart, Franz Xaver Wolfgang Mozart) 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 33. N-Gram Information Extraction N-Gram Information Extraction Requirements usually binary relationships between entities short items of interest short patterns ok: “<X> and other <Y>” not: “<X> has an inflation rate of <Y>” 10 / 27 Information Extraction from Web-Scale N-Gram Data
  • 34. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 35. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text Less control over the selection of input documents 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 36. N-Gram Information Extraction N-Gram Information Extraction Risks Influence of spam and boilerplate text Less control over the selection of input documents Less context information (WSD, POS tagging, parsing) 11 / 27 Information Extraction from Web-Scale N-Gram Data
  • 37. N-Gram Information Extraction N-Gram Information Extraction Then why use n-grams? much larger input (petabytes of original data) better coverage higher precision (more evidence, more redundancy) Pantel (2004): more data allows a rather simple technique to outperform much more sophisticated algorithms 12 / 27 Information Extraction from Web-Scale N-Gram Data
  • 38. N-Gram Information Extraction N-Gram Information Extraction Then why use n-grams? much larger input (petabytes of original data) better coverage higher precision (more evidence, more redundancy) Pantel (2004): more data allows a rather simple technique to outperform much more sophisticated algorithms availability larger than available document collections crawling the Web: slow, requires link farm detection, high bandwidth 12 / 27 Information Extraction from Web-Scale N-Gram Data
  • 39. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 40. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation e.g. for isA relation: (dogs,animals), (gold,metal) e.g. for partOf: (finger,hand), (leaves,trees), (windows,houses) 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 41. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds query n-gram dataset: “dogs * animals” (and “animals * dogs”) alternatively: “dogs ? animals”, “dogs ? ? animals”, . . . alternatively: fall back to separate document collection 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 42. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns (dogs,animals) found in “.... dogs and other animals ...” “<X> and other <Y>” 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 43. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples “<X> and other <Y>” finds (Z¨urich,cities) “Z¨urich and other cities” (apples,fruits) “apples and other fruits” 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 44. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples 3 Finally, rank the candidate tuples, choose output tuples Supervised learning based on labeled set of tuples Output: Accepted tuples like (Geneva,city). 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 45. N-Gram Information Extraction Information Extraction Algorithm 1 collect patterns input: seed tuples for a relation find n-grams containing seeds generalize to textual patterns 2 Search for patterns in n-grams data candidate tuples 3 Finally, rank the candidate tuples, choose output tuples Features: for a tuple (x, y) fi (p(x, y)) for each datasource i and pattern p p∈P fi (p(x, y)) for each datasource i 13 / 27 Information Extraction from Web-Scale N-Gram Data
  • 46. Experiments Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 14 / 27 Information Extraction from Web-Scale N-Gram Data
  • 47. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 48. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 49. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text positive: distributed (around 60GB uncompressed) 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 50. Experiments Datasets 1 Google Web 1T 5-Gram Corpus contains n-gram statistics for n = 1 . . . 5 generated from around 1012 words of text positive: distributed (around 60GB uncompressed) negative: cut-off frequency 40 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 51. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 52. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 53. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index also: statistics from titles (12.5G tokens) and anchor texts (357G tokens) 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 54. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus currently 3,4-grams, smoothed language models generated from around 1.4T tokens, complete English US version of Bing index also: statistics from titles (12.5G tokens) and anchor texts (357G tokens) WSDL-based web service 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 55. Experiments Datasets 1 Google Web 1T 5-Gram Corpus 2 Microsoft Web N-gram Corpus 3 ClueWeb09 5-grams 500 million web pages, 700M 5-grams 15 / 27 Information Extraction from Web-Scale N-Gram Data
  • 56. Experiments Seeds and Patterns Patterns Relation Seeds discovered isA 100 2991 partOf 100 3883 hasProperty 100 3175 seeds from MIT ConceptNet even among highest-ranked: partOf(children,parents) and isA(winning,everything) 16 / 27 Information Extraction from Web-Scale N-Gram Data
  • 57. Experiments Pattern Examples: isA Pattern PMI range <X> and almost any <Y> high <X> betting basketball betting <Y> high <X> is my favorite <Y> high <X> shoes online shoes <Y> high <X> is a <Y> medium <X> is the best <Y> medium <X> or any other <Y> medium <X> , and <Y> medium <X> and other smart <Y> medium <X> and grammar <Y> low <X> content of the <Y> low <X> when it changes <Y> low 17 / 27 Information Extraction from Web-Scale N-Gram Data
  • 58. Experiments Pattern Examples: partOf Pattern PMI range <X> with the other <Y> high <X> of the top <Y> high <X> online <Y> high <X> shoes online shoes <Y> high <X> from the <Y> medium <X> or even entire <Y> medium <X> of host <Y> medium <X> from <Y> medium <X> of a different <Y> medium <X> entertainment and <Y> low <X> Download for thou <Y> low <X> company home in <Y> low 18 / 27 Information Extraction from Web-Scale N-Gram Data
  • 59. Experiments Pattern: Microsoft Document Body 3- grams vs. Anchor 3-grams (each point represents the sum of pattern scores for a tuple) 19 / 27 Information Extraction from Web-Scale N-Gram Data
  • 60. Experiments Patterns: Microsoft Document Body 3- grams vs. Title 3-grams (each point represents the sum of pattern scores for a tuple) 20 / 27 Information Extraction from Web-Scale N-Gram Data
  • 61. Experiments Patterns: Microsoft Document Body 3- grams vs. Google Body 3-grams (each point represents the sum of pattern scores for a tuple) 21 / 27 Information Extraction from Web-Scale N-Gram Data
  • 62. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 63. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 64. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 10-fold leave one out cross-validation 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 65. Experiments Overall Results (all data sources simultaneously) Approach learning: RBF-kernel SVMs, also: random forests, C4.5, AdaBoost ∼ 500 random labelled examples per relation (matching any of the patterns) 10-fold leave one out cross-validation =⇒ Recall is relative to union of pattern matches 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 66. Experiments Overall Results (all data sources simultaneously) Relation Precision Recall F1 Output per million n-grams1 isA 88.9% 8.1% 14.8% 983 partOf 80.5% 34.0% 47.8% 7897 hasProperty 75.3% 99.3% 85.6% 26180 1: the expected number of distinct accepted tuples per million input n-grams (the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million) 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 67. Experiments Overall Results (all data sources simultaneously) Relation Precision Recall F1 Output per million n-grams1 isA 88.9% 8.1% 14.8% 983 partOf 80.5% 34.0% 47.8% 7897 hasProperty 75.3% 99.3% 85.6% 26180 1: the expected number of distinct accepted tuples per million input n-grams (the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million) linguistic information implicitly captured via combinations of patterns! 22 / 27 Information Extraction from Web-Scale N-Gram Data
  • 68. Experiments Detailed Results (partOf relation) Dataset Source Prec. Recall F1 Google 3-grams Document Body 55.9% 38.5% 45.6% Google 4-grams Document Body 52.6% 43.3% 47.5% Google 5-grams Document Body 48.1% 42.8% 45.3% ClueWeb 5-grams Document Body 51.7% 35.6% 42.2% Google 3-/4- grams Document Body 53.9% 42.8% 47.7% Google 3-/4-/5- grams Document Body 58.7% 43.8% 50.1% 23 / 27 Information Extraction from Web-Scale N-Gram Data
  • 69. Experiments Detailed Results (partOf relation) Dataset Source Prec. Recall F1 Microsoft 3-grams Document Body 58.5% 33.2% 42.3% Microsoft 3-grams Document Title 51.7% 29.8% 37.8% Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2% Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5% Google 3-grams Document Body 55.9% 38.5% 45.6% Microsoft 3/4- grams Body (3-grams only) / Title / Anchor 40.5% 98.1% 57.3% Google 3/4- grams Document Body 53.9% 42.8% 47.7% Google 3/4/5- grams Document Body 58.7% 43.8% 50.1% All 3/4/5- grams Body / Title / Anchor 80.5% 34.0% 47.8% 24 / 27 Information Extraction from Web-Scale N-Gram Data
  • 70. Experiments Example: hasProperty Properties of “flowers” 25 / 27 Information Extraction from Web-Scale N-Gram Data
  • 71. Conclusion Outline 1 Information Extraction 2 N-Gram Information Extraction 3 Experiments 4 Conclusion 26 / 27 Information Extraction from Web-Scale N-Gram Data
  • 72. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 73. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 74. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns more data helps (even at very large scales) 27 / 27 Information Extraction from Web-Scale N-Gram Data
  • 75. Conclusion Summary Lessons Learnt N-grams datasets allow for Information Extraction from petabytes of original data Requirements: short entity names, short patterns more data helps (even at very large scales) diversity of data sources helps 27 / 27 Information Extraction from Web-Scale N-Gram Data