Search engines are increasingly relying on structured data to provide direct answers to certain types of queries. However, extracting such structured data from text is challenging, especially due to the scarcity of explicitly expressed knowledge. Even when relying on large document collections, pattern-based information extraction approaches typically expose only insufficient amounts of information. This paper evaluates to what extent n-gram statistics, derived from volumes of texts several orders of magnitude larger than typical corpora, can allow us to overcome this bottleneck. An extensive experimental evaluation is provided for three different binary relations, comparing different sources of n-gram data as well as different learning algorithms.
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
In search for the ideal csv template to map elections
The Minimal Set
Election Results
- Tricky Write-ins
- Don’t forget the Residual Vote
- From RAW to DATA
- Percentages? What Percentages?
- Mapping Issues
Colors & Positioning (Electoral Compass)
- Country Specific
- Global Color Scheme
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Redundancy analysis on linked data #cold2014 #ISWC2014honghan2013
Wu, Honghan, Boris Villazon-Terrazas, Jeff Z. Pan, and Jose Manuel Gomez-Perez. “How Redundant Is It? – An Empirical Analysis on Linked Datasets.” In ISWC COLD Workshop. 2014.
http://ceur-ws.org/Vol-1264/cold2014_WuVPG.pdf
Interlinking Standardized OpenStreetMap Data and Citizen Science Data in the ...Werner Leyh
Abstract. The aim of this work is to explore the opportunities offered by
semantic standardization to interlink primary “spatial data” (GI) from “Open-
StreetMap” (OSM) with repositories of the “Linked Open Data Cloud” (LOD).
Research in natural sciences can generate vast amounts of spatial data, where
Wikidata could be considered as the central hub between more detailed natural
science hubs on the spatial semantic web. Wikidata is a world readable and
writable community-driven knowledge base. It offers the opportunity to collaboratively
construct an open access knowledge graph that spans biology,
medicine, and all other domains of knowledge. In this study, we discuss
the opportunities and challenges provided by exploring Wikidata as a central
integration facility by interlink it with OSM, a popular, community driven
collection of free geographic data. This is empowered by the reuse of terms
and properties from commonly understood controlled vocabularies that
represent their respective well-identified knowledge domains.
URL: https://www.springerprofessional.de/en/interlinking-standardized-openstreetmap-data-and-citizen-science/13302088
DOI: https://doi.org/10.1007/978-3-319-60366-7_9
Werner Leyh, Homero Fonseca Filho
University of São Paulo (USP), São Paulo, Brazil
WernerLeyh@yahoo.com
In search for the ideal csv template to map elections
The Minimal Set
Election Results
- Tricky Write-ins
- Don’t forget the Residual Vote
- From RAW to DATA
- Percentages? What Percentages?
- Mapping Issues
Colors & Positioning (Electoral Compass)
- Country Specific
- Global Color Scheme
Visualising the Australian open data and research data landscapeJonathan Yu
"Visualising the Australian open data and research data landscape" at C3DIS May 2018 in Melbourne. In this talk, we presented work around the visualisation of an survey of open government and research data in Australia. This features a first attempt at formalising a quantitative based approach to measuring the data ecosystem in Australia.
Redundancy analysis on linked data #cold2014 #ISWC2014honghan2013
Wu, Honghan, Boris Villazon-Terrazas, Jeff Z. Pan, and Jose Manuel Gomez-Perez. “How Redundant Is It? – An Empirical Analysis on Linked Datasets.” In ISWC COLD Workshop. 2014.
http://ceur-ws.org/Vol-1264/cold2014_WuVPG.pdf
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti
Presented at SISAP 2016 (http://sisap.org/2016/index.html)
Paper: https://goo.gl/xAcyTq
Abstract:
We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.
Presentation of QALD 7 challenge at ESWC2017: Question Answering over Linked Data.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
QALD-7 @ ESWC 2017 Portoroz, Slovenia
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
An introduction to the basics and benefits of using Open Data. Slides from my presentation of this topic at the JavaZone 2009 conference in Oslo, Norway.
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Georg Vogeler
Presentation at Workshop on Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies, Université de Lausanne, 3-4 June 2019 | #graphSDE2019
(http://wp.unil.ch/graphsde/program/)
Web open standards for linked data and knowledge graphs as enablers of EU dig...Fabien Gandon
Web open standards for linked data and knowledge graphs as enablers of EU digital sovereignty
ENDORSE Keynote by Fabien GANDON, 19/03/2021
https://op.europa.eu/en/web/endorse
Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Using the Web of Data for Information ExtractionBenjamin Adrian
Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.
Efficient Top-k Algorithms for Fuzzy Search in String Collectionsrvernica
An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.
[EN] DLM Forum Industry Whitepaper 01 Capture Indexing & Auto-Classification | SER | Christa Holzenkamp | Hamburg 2002
1. Introduction
2. The importance of safe indexing
2.1 Description of the problem
2.2 The challenge of rapidly growing document volumes
2.3 The quality of indexing defines the quality of retrieval
2.4 The role of metadata for indexing and information exchange
2.5 The need for quality standards, costs and legal aspects
3. Methods for indexing and auto-categorization
3.1 Types of indexing and categorization methods
3.2 Auto-categorization methods
3.3 Extraction methods
3.4 Handling different types of information and document representations
4. The Role of Databases
4.1 Database types and related indexing
4.2 Indexing and Search methods
4.3 Indexing and retrieval methods using natural languages
5. Standards for Indexing
5.1 Relevant standards for indexing and ordering methods
5.2 Relevant standardisation bodies and initiatives
6. Best Practice Applications
6.1 Automated distribution of incoming documents Project of the Statistical Office of the Free State of Saxony
6.2 Knowledge-Enabled Content Management Project of CHIP Online International GmbH
7. Outlook
7.1 Citizen Portals
7.2 Natural language based portals
Glossary
Abbreviations
Authoring Company
Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti
Presented at SISAP 2016 (http://sisap.org/2016/index.html)
Paper: https://goo.gl/xAcyTq
Abstract:
We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.
Presentation of QALD 7 challenge at ESWC2017: Question Answering over Linked Data.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
QALD-7 @ ESWC 2017 Portoroz, Slovenia
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
An introduction to the basics and benefits of using Open Data. Slides from my presentation of this topic at the JavaZone 2009 conference in Oslo, Norway.
Standing-off Trees and Graphs : on the affordance of technologies for the edi...Georg Vogeler
Presentation at Workshop on Scholarly Digital Editions, Graph Data-Models and Semantic Web Technologies, Université de Lausanne, 3-4 June 2019 | #graphSDE2019
(http://wp.unil.ch/graphsde/program/)
Web open standards for linked data and knowledge graphs as enablers of EU dig...Fabien Gandon
Web open standards for linked data and knowledge graphs as enablers of EU digital sovereignty
ENDORSE Keynote by Fabien GANDON, 19/03/2021
https://op.europa.eu/en/web/endorse
Slides about "Information and Data Extraction on the Web" for "Information management on the Web" course at DIA (Computer Science Department) of Roma Tre University
Using the Web of Data for Information ExtractionBenjamin Adrian
Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.
Efficient Top-k Algorithms for Fuzzy Search in String Collectionsrvernica
An approximate search query on a collection of strings finds those strings in the collection that are similar to a given query string, where similarity is defined using a given similarity function such as Jaccard, cosine, and edit distance. Answering approximate queries efficiently is important in many applications such as search engines, data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. In this paper, we study the problem of efficiently computing the best answers to an approximate string query, where the quality of a string is based on both its importance and its similarity to the query string. We first develop a progressive algorithm that answers a ranking query by using the results of several approximate range queries, leveraging existing search techniques. We then develop efficient algorithms for answering ranking queries using indexing structures of gram-based inverted lists. We answer a ranking query by traversing the inverted lists, pruning and skipping irrelevant string ids, iteratively increasing the pruning and skipping power, and doing early termination. We have conducted extensive experiments on real datasets to evaluate the proposed algorithms and report our findings.
[EN] DLM Forum Industry Whitepaper 01 Capture Indexing & Auto-Classification | SER | Christa Holzenkamp | Hamburg 2002
1. Introduction
2. The importance of safe indexing
2.1 Description of the problem
2.2 The challenge of rapidly growing document volumes
2.3 The quality of indexing defines the quality of retrieval
2.4 The role of metadata for indexing and information exchange
2.5 The need for quality standards, costs and legal aspects
3. Methods for indexing and auto-categorization
3.1 Types of indexing and categorization methods
3.2 Auto-categorization methods
3.3 Extraction methods
3.4 Handling different types of information and document representations
4. The Role of Databases
4.1 Database types and related indexing
4.2 Indexing and Search methods
4.3 Indexing and retrieval methods using natural languages
5. Standards for Indexing
5.1 Relevant standards for indexing and ordering methods
5.2 Relevant standardisation bodies and initiatives
6. Best Practice Applications
6.1 Automated distribution of incoming documents Project of the Statistical Office of the Free State of Saxony
6.2 Knowledge-Enabled Content Management Project of CHIP Online International GmbH
7. Outlook
7.1 Citizen Portals
7.2 Natural language based portals
Glossary
Abbreviations
Authoring Company
Talk based on: Ricardo Baeza-Yates and Carlos Castillo: “Web Retrieval and Mining”.Entry in “Encyclopedia of Library and Information Sciences”, third edition (to appear in 2009).
English 103 Final TestReading Poetry for 10Answer all five TanaMaeskm
English 103 Final Test
Reading Poetry for 10%
Answer all five the questions below. For simplicity, you may type your work directly onto this document. Limit each of your answers (e.g. 1a etc.) to approximately four or five sentences.
Before you begin, a note on citing poetry references in MLA style. Direct quotations should be followed by a parenthetical citation that displays the relevant line number(s): for example, “moonless night rim world” (Lowther 2). Short quotations of three lines or less should be enmeshed in your prose and line breaks should be recorded with a slash: for example, “moonless night rim world, / A long-dead city” (Lowther 2-3). No need to provide a list of works cited for the poems at the end of your test paper, if you are using the digital reading package for this course. However, if you consult any secondary sources you must fully reference each and every one.
1. “Vancity’ (2 marks)
a. What is the speaker’s situation? Integrate some quoted evidence in your answer.
b. How does Lowther use metaphor skilfully in this poem?
2. “take a st. and” (2 marks)
a. Why is the speaker in this poem care what’s happening to the water under a street corner in Vancouver?
b. How does the poet arrange text on the page to reflect a central theme in the poem?
3. “What Jack Shadbolt said” (2 marks)
a. Why are Jack Shadbolt’s words important to the speaker in this poem?
b. How does the poet (Carolan) organize the stanzas and line breaks to guide the reader through the poem?
4. “happy birthday dear house” (2 marks)
a. Which sounds and images in this poem most vividly evoke the speaker’s childhood?
b. How do you interpret the lines, “in the land/of the second chance” (lines 20-21)?
5. “Quayside” (2 marks)
a. Identify a central theme in this poem and comment on how the poet (Lau) explores it. Include a few short direct quotes in your answer.
b. Why are the lines “—and then I saw that for years/you have existed all around me” (12-13) especially significant in this poem?
\ ( (
P A N G . N I N G T A N
M i c h i g a n S t a t e U n i v e r s i t y
M I C H A E L S T E I N B A C H
U n i v e r s i t y o f M i n n e s o t a
V I P I N K U M A R
U n i v e r s i t y o f M i n n e s o t a
a n d A r m y H i g h P e r f o r m a n c e
C o m p u t i n g R e s e a r c h C e n t e r
+f.f_l crf.rfh. .W if f
aqtY 6l$
t . T . R . C .
i'&'ufe61ttt1/.
Y \ t.\ $t,/,1'
n,5 \ . 7 \ V
' 4 8 !
Boston San Francisco NewYork
London Toronto Sydney Tokyo Singapore Madrid
MexicoCity Munich Paris CapeTown HongKong Montreal
G.R
r+6,q
If you purchased this book within the United States or Canada you should be aware that it has been
wrongfirlly imported without the approval of the Publishel or the Author.
T3
Loo 6
- {)gq*
3 AcquisitionsEditor Matt Goldstein
ProjectEditor Katherine Harutunian
Production Supervisor Marilyn Lloyd
Production Services Paul C. Anagnostopoulos of Windfall Software
Ma ...
The PRTR data has been called for helping members of the EU on their decision-making
policies, but also has a huge potential for concerning citizens about the activity the industries and
the environment. So, opening PRTR data freely available without copyright restrictions would
boost the interest of re-using it, and even of building commercial applications. Actually, opening
data is today one of the key points at the Digital Agenda: Europe has an Open Data Strategy[1],
which is expected to deliver a €40 billion boost to the economy per year.
While opening the PRTR data could be done in a such variety of ways, there's some guidelines
that could be followed in order to make easier the work for citizens and developers. This
conference introduces the project #adoptaunaplaya[2], which aims to correlate pollutant and waste
report data with the quality of bathing waters in Spain. The challenges found on re-using PRTR's
data and the lessons learnt will drive us to talk about some recommendations for the PRTR's
open data strategy.
Speech at IMPEL Conference[3] Session #9 (Capacity Building)
[1] http://europa.eu/rapid/press-release_IP-11-1524_en.htm
[2] http://adoptaunaplaya.es
[3] http://environmentconference.mepa.org.mt/programme.html#session9
The term 'Data Scientist' arose fairly recently to express the specialised recruitment needs of certain well-known data-driven Silicon Valley firms. It signifies a mix of diverse and rare talents, mostly drawing from Computer Science (with emphasis on Big Data), Statistics and Machine Learning. In this talk, we will attempt to briefly survey the state-of-the-art both in terms of problems and solutions at the vanguard of Data Science. We will cover both novel developments, as well as centuries-old best practices, in an attempt to demonstrate that Data Science is indeed a Science, in the full sense of the word. This talk represents part of a seminar series that the speaker has given across the world, including Google (Mountainview), Cisco (San Jose) and Aviva Headquarters (London), and represents joint work with Professor David Hand (OBE).
Drowning in information – the need of macroscopes for research fundingAndrea Scharnhorst
Andrea Scharnhorst (2015) Drowning in information – the need of macroscopes for research funding. Presentation at the international conference: PLANNING, PREDICTION, SCENARIOS - Using Simulations and Maps - 2015 Annual EA Conference - 11–12 May 2015 Bonn
2016 07 12_purdue_bigdatainomics_seandavisSean Davis
Newer, faster, cheaper molecular assays are driving biomedical research. I discuss the history of biomedical data including concepts of data sharing, hypothesis-driven vs generating research, and the potential to expand our thinking on biomedical research to be much more integrated through smart, creative, and open use of technologies and more flexible, longitudinal studies.
Guest lecture at the Syracuse University School of Information Studies eScience Librarianship Lecture Series (08 Dec 2011).
Description: It’s your government, is it your data? New approaches to building interlinked catalogs of government-produced data. Dr. John S. Erickson, Director of Web Science Operations for the Tetherless World Constellation at Rensselaer Polytechnic Institute will present technical methods being developed to manage the delivery of large-scale open government data projects based on semantic web and linked data best practices.
I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.
Similar to Information Extraction from Web-Scale N-Gram Data (20)
SEMAC Graph Node Embeddings for Link PredictionGerard de Melo
We present a new graph representation learning approach called SEMAC that jointly exploits fine-grained node features as well as the overall graph topology. In contrast to the SGNS or SVD methods espoused in previous representation-based studies, our model represents nodes in terms of subgraph embeddings acquired via a form of convex matrix completion to iteratively reduce the rank, and thereby, more effectively eliminate noise in the representation. Thus, subgraph embeddings and convex matrix completion are elegantly integrated into a novel link prediction framework.
While traditional scholarship has tended to emphasize thorough reading, reflection, and learning, many researchers nowadays – both in academia and industry – find themselves in a fast-paced and demanding environment. A successful research career crucially depends on management-related skills, and devoting some time to such skills is likely to pay off very quickly. One important example is time and task management, which is critical when there are numerous conflicting demands and opportunities. Another example is being able to cope with challenges and failure. Researchers also need to be creative and bold in defending their ideas. This talk provides an overview of these and other skills that are vital in modern research environments.
Knowlywood: Mining Activity Knowledge from Hollywood NarrativesGerard de Melo
Knowlywood is a new knowledge graph mined from movies, TV series, and literature. It provides commonsense knowledge about human activities, e.g. participants, preceding and following activities, and so on.
Big Data is more than just hype. The vast quantities of data now available have led to two important challenges that are fundamentally changing the way we develop data-intensive systems. The first is at the data management level, where we are finally moving beyond vanilla MapReduce towards infrastructure that allows for more flexible data processing pipelines. The second challenge is transitioning from quantity to quality and distilling genuine knowledge from the raw data. For this, we still need innovative algorithms that facilitate data cleaning, unsupervised and semi-supervised learning, knowledge harvesting, and knowledge integration. Examples include data integration, and large-scale knowledge bases such as UWN/MENTA, and collections of commonsense knowledge such as WebChild.
Scalable Learning Technologies for Big Data MiningGerard de Melo
These are slides of a tutorial by Gerard de Melo and Aparna Varde presented at the DASFAA 2015 conference.
As data expands into big data, enhanced or entirely novel data mining algorithms often become necessary. The real value of big data is often only exposed when we can adequately mine and learn from it. We provide an overview of new scalable techniques for knowledge discovery. Our focus is on the areas of cloud data mining and machine learning, semi-supervised processing, and deep learning. We also give practical advice for choosing among different methods and discuss open research problems and concerns.
These are slides of a tutorial at ECIR by Gerard de Melo and Katja Hose.
Search is currently undergoing a major paradigm shift away from the traditional document-centric “10 blue links” towards more explicit and actionable information. Recent advances in this area are Google’s Knowledge Graph, Virtual Personal Assistants such as Siri and Google Now, as well as the now ubiquitous entity-oriented vertical search results for places, products, etc. Apart from novel query understanding methods, these developments are largely driven by structured data that is blended into the Web Search experience. We discuss efficient indexing and query processing techniques to work with large amounts of structured data. Finally, we present query interpretation and understanding methods to map user queries to these structured data sources.
From Linked Data to Tightly Integrated DataGerard de Melo
Invited Talk at the 3rd Workshop on Linked Data in Linguistics: Multilingual Knowledge Resources and Natural Language Processing. Reykjavik, Iceland, 27th May 2014
The ideas behind the Web of Linked Data have great allure. Apart from the prospect of large amounts of freely available data, we are also promised nearly effortless interoperability. Common data formats and protocols have indeed made it easier than ever to obtain and work with information from different sources simultaneously, opening up new opportunities in linguistics, library science, and many other areas.
In this talk, however, I argue that the true potential of Linked Data can only be appreciated when extensive cross-linkage and integration engenders an even higher degree of interconnectedness. This can take the form of shared identifiers, e.g. those based on Wikipedia and WordNet, which can be used to describe numerous forms of linguistic and commonsense knowledge. An alternative is to rely on sameAs and similarity links, which can automatically be discovered using scalable approaches like the LINDA algorithm but need to be interpreted with great care, as we have observed in experimental studies. A closer level of linkage is achieved when resources are also connected at the taxonomic level, as exemplified by the MENTA approach to taxonomic data integration. Such integration means that one can buy into ecosystems already carrying a range of valuable pre-existing assets. Even more tightly integrated resources like Lexvo.org combine triples from multiple sources into unified, coherent knowledge bases. Finally, I also comment on how to address some remaining challenges that are still impeding a more widespread adoption of Linked Data on the Web. In the long run, I believe that such steps will lead us to significantly more tightly integrated Linked Data.
UWN: A Large Multilingual Lexical Knowledge BaseGerard de Melo
We present UWN, a large multilingual lexical knowledge base that describes the meanings and relationships of words in over 200 languages. This paper explains how link prediction, information integration and taxonomy induction methods have been used to build UWN based on WordNet and extend it with millions of named entities from Wikipedia. We additionally introduce extensions to cover lexical relationships, frame-semantic knowledge, and language data. An online interface provides human access to the data, while a software API enables applications to look up over 16 million words and names.
Multilingual Text Classification using OntologiesGerard de Melo
In this paper, we investigate strategies for automatically classifying documents in different languages thematically, geographically or according to other criteria. A novel linguistically motivated text representation scheme is presented that can be used with machine learning algorithms in order to learn classifications from pre-classified examples and then automatically classify documents that might be provided in entirely different languages. Our approach makes use of ontologies and lexical resources but goes beyond a simple mapping from terms to concepts by fully exploiting the external knowledge manifested in such resources and mapping to entire regions of concepts. For this, a graph traversal algorithm is used to explore related concepts that might be relevant. Extensive testing has shown that our methods lead to significant improvements compared to existing approaches.
Extracting Sense-Disambiguated Example Sentences From Parallel CorporaGerard de Melo
Example sentences provide an intuitive means of grasping the meaning of a word, and are frequently used to complement conventional word definitions. When a word has multiple meanings, it is useful to have example sentences for specific senses (and hence definitions) of that word rather than indiscriminately lumping all of them together. In this paper, we investigate to what extent such sense-specific example sentences can be extracted from parallel corpora using lexical knowledge bases for multiple languages as a sense index. We use word sense disambiguation heuristics and a cross-lingual measure of semantic similarity to link example sentences to specific word senses. From the sentences found for a given sense, an algorithm then selects a smaller subset that can be presented to end users, taking into account both representativeness and diversity. Preliminary results show that a precision of around 80% can be obtained for a reasonable number of word senses, and that the subset selection yields convincing results.
Towards a Universal Wordnet by Learning from Combined EvidenceGerard de Melo
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their
meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high
level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification.
Not Quite the Same: Identity Constraints for the Web of Linked DataGerard de Melo
Linked Data is based on the idea that information from different sources can flexibly be connected to enable novel applications that individual datasets do not support on their own. This hinges upon the existence of links between datasets that would otherwise be isolated. The most notable form, sameAs links, are intended to express that two identifiers are equivalent in all respects. Unfortunately, many existing ones do not reflect such genuine identity. This study provides a novel method to analyse this phenomenon, based on a thorough theoretical analysis, as well as a novel graph-based method to resolve such issues to some extent. Our experiments on a representative Web-scale set of sameAs links from the Web of Data show that our method can identify and remove hundreds of thousands of constraint violations.
Good, Great, Excellent: Global Inference of Semantic IntensitiesGerard de Melo
Adjectives like good, great, and excellent are similar in meaning, but differ in intensity. Intensity order information is very useful for language learners as well as in several NLP tasks, but is missing in most lexical resources (dictionaries, WordNet, and thesauri). In this paper, we present a primarily unsupervised approach that uses semantics from Web-scale data (e.g., phrases like good but not excellent) to rank words by assigning them positions on a continuous scale. We rely on Mixed Integer Linear Programming to jointly determine the ranks, such that individual decisions benefit from global information. When ranking English adjectives, our global algorithm achieves substantial improvements over previous work on both pairwise and rank correlation metrics (specifically, 70% pairwise accuracy as compared to only 56% by previous work). Moreover, our approach can incorporate external synonymy information (increasing its pairwise accuracy to 78%) and extends easily to new languages.
YAGO-SUMO: Integrating YAGO into the Suggested Upper Merged OntologyGerard de Melo
The YAGO-SUMO integration incorporates millions of entities from YAGO, which is based on Wikipedia and WordNet, into the Suggested Upper Merged Ontology (SUMO), a highly axiomatized formal upper ontology. With the combined force of the two ontologies, an enormous, unprecedented corpus of formalized world knowledge is available for automated processing and reasoning, providing information about millions of entities such as people, cities, organizations, and companies.
Compared to the original YAGO, more advanced reasoning is possible due to the axiomatic knowledge delivered by SUMO. A reasoner can conclude e.g. that a child of a human must also be a human and cannot be born before its parents, or that two people sharing the same parents must be siblings.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Information Extraction from Web-Scale N-Gram Data
1. Information Extraction
from Web-Scale N-Gram Data
Niket Tandon and Gerard de Melo
2010-07-23
Max Planck Institute for Informatics
Saarbr¨ucken, Germany
1 / 27
Information Extraction from Web-Scale N-Gram Data
8. Information Extraction
Introduction
Other Applications
Query expansion
Semantic analysis
Faceted search
Entity Tracking
Document Enrichment
Mobile Services
Visual Object Recognition
etc.
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
3 / 27
Information Extraction from Web-Scale N-Gram Data
12. Information Extraction
Introduction
and the love of friends' [p] Happy as the grass was green' [p] Come live with me, and be my
lawns swoop around the sunken garden. The grass is emerald green and perfect-a tribute to
overlooking the silver river. All round her the grass stretched green, but stunted, browning in the
the ground steadied beneath them, and the grass turned green, swishing high around their
to see the sun shine, the flowers blossom, the grass grow green. I could not bear to hear the
are quite dwarf. M. sinensis. Chinese silver grass. Ample green- and silver-striped foliage but
in either of them." It was summer and the grass was green. Clive Rappaport was a solicitor,
however, each bank is lined with stands of grass that remain green and stand taller than the
groaned and farted and schemed for snatches of grass that showed green at the corners of his bits,
the flowers were blossoming profusely and the grass was richly green. The people of the village
Song. [f] He is dead and gone; At his head a grass-green turf, At his heels a stone." O, ho! [f]
hard thoughts I stand by popple scrub, in tall grass, blown over and harsh, green and dry. From my
Well the sky is blue and er [tc text=pause] the grass is green and [tc text=pause] there's
Yes. Yes. [F01] Dreadful things. Erm so the grass was never quite as green [ZF1] as [ZF0] as
be beautiful on there really beautiful. All the grass lush and green not a car parked on it
Information Extraction
Users generally want
information,
not documents
Structured data
Direct, instant answers
... and more
Where do we obtain
such data?
3 / 27
Information Extraction from Web-Scale N-Gram Data
13. Information Extraction
How do we get Structured Data?
Structured Data
isA(Guggenheim,Museum)
locatedIn(Guggenheim,Manhattan)
partOf(Manhattan,NewYork)
. . .
4 / 27
Information Extraction from Web-Scale N-Gram Data
14. Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
5 / 27
Information Extraction from Web-Scale N-Gram Data
15. Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
16. Information Extraction
How do we get Structured Data?
Pattern-Based Approaches
Use simple textual patterns to extract information
(Lyons 1977, Cruse 1986, Hearst 1992)
e.g. “<Y> such as <X>”
“cities such as Salem” isA(Salem,City)
e.g. “<X> and other <Y>”
“Lausanne and other cities” isA(Lausanne,City)
5 / 27
Information Extraction from Web-Scale N-Gram Data
17. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
6 / 27
Information Extraction from Web-Scale N-Gram Data
18. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
One Possibility: Sophisticated NLP (1990s)
MUC evaluation initiative
CRF-style segmentation methods
etc.
6 / 27
Information Extraction from Web-Scale N-Gram Data
19. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
20. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
21. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
6 / 27
Information Extraction from Web-Scale N-Gram Data
22. Information Extraction
How do we get Structured Data?
Problem: Pattern Matches are Rare
Hearst found only 46 facts in 20 million word New York Times
article collection
Alternative: Use Larger Corpora
American National Corpus: 22 million words
British National Corpus: 100 million words
English Wikipedia: 1 000 million words
Agichtein (2005), Pantel (2004): scalable IE, but still only a
small fraction of the entire Web
6 / 27
Information Extraction from Web-Scale N-Gram Data
24. Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
7 / 27
Information Extraction from Web-Scale N-Gram Data
25. Information Extraction
Web Search Engines
Problems
Need to know what you’re looking for.
Can only retrieve top-k results
Very slow: days instead of minutes – Cafarella (2005)
Instead
Use n-gram statistics derived
from very large parts of the
Web!
7 / 27
Information Extraction from Web-Scale N-Gram Data
26. N-Gram Information Extraction
Outline
1 Information Extraction
2 N-Gram Information Extraction
3 Experiments
4 Conclusion
8 / 27
Information Extraction from Web-Scale N-Gram Data
27. N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
9 / 27
Information Extraction from Web-Scale N-Gram Data
28. N-Gram Information Extraction
N-Gram Data
Web-Scale N-Gram Datasets
Web-scale n-gram statistics derived from around 1012
words of
text are available
Provides: Frequencies/Language model for strings
Example: f(“cities such as Geneva”)=...
f(“Z¨urich and other cities”)=...
f(“Lausanne and other Swiss cities”)=...
9 / 27
Information Extraction from Web-Scale N-Gram Data
29. N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
ok:
if independently extractable, e.g. founding year and location of
organization
not ok:
“<V> imported <W> dollars worth of <X> from <Y>
in year <Z>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
30. N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
10 / 27
Information Extraction from Web-Scale N-Gram Data
31. N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
ok:
birthYear(Mozart,1756)
not:
fatherOf(Wolfgang Amadeus
Mozart,F. X. Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
32. N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
no way:
fatherOf(Johannes
Chrysostomus Wolfgangus
Theophilus Mozart,
Franz Xaver Wolfgang
Mozart)
10 / 27
Information Extraction from Web-Scale N-Gram Data
33. N-Gram Information Extraction
N-Gram Information Extraction
Requirements
usually binary relationships between entities
short items of interest
short patterns
ok:
“<X> and other <Y>”
not:
“<X> has an inflation rate of <Y>”
10 / 27
Information Extraction from Web-Scale N-Gram Data
34. N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
11 / 27
Information Extraction from Web-Scale N-Gram Data
35. N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
11 / 27
Information Extraction from Web-Scale N-Gram Data
36. N-Gram Information Extraction
N-Gram Information Extraction
Risks
Influence of spam and boilerplate text
Less control over the selection of input documents
Less context information (WSD, POS tagging, parsing)
11 / 27
Information Extraction from Web-Scale N-Gram Data
37. N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
12 / 27
Information Extraction from Web-Scale N-Gram Data
38. N-Gram Information Extraction
N-Gram Information Extraction
Then why use n-grams?
much larger input (petabytes of original data)
better coverage
higher precision (more evidence, more redundancy)
Pantel (2004): more data allows a rather simple technique to
outperform much more sophisticated algorithms
availability
larger than available document collections
crawling the Web: slow, requires link farm detection, high
bandwidth
12 / 27
Information Extraction from Web-Scale N-Gram Data
40. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
e.g. for isA relation: (dogs,animals), (gold,metal)
e.g. for partOf: (finger,hand), (leaves,trees),
(windows,houses)
13 / 27
Information Extraction from Web-Scale N-Gram Data
41. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
query n-gram dataset: “dogs * animals” (and “animals * dogs”)
alternatively: “dogs ? animals”, “dogs ? ? animals”, . . .
alternatively: fall back to separate document collection
13 / 27
Information Extraction from Web-Scale N-Gram Data
42. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
(dogs,animals) found in
“.... dogs and other animals ...”
“<X> and other <Y>”
13 / 27
Information Extraction from Web-Scale N-Gram Data
43. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
“<X> and other <Y>” finds
(Z¨urich,cities) “Z¨urich and other cities”
(apples,fruits) “apples and other fruits”
13 / 27
Information Extraction from Web-Scale N-Gram Data
44. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Supervised learning based on labeled set of tuples
Output: Accepted tuples like (Geneva,city).
13 / 27
Information Extraction from Web-Scale N-Gram Data
45. N-Gram Information Extraction
Information Extraction
Algorithm
1 collect patterns
input: seed tuples for a relation
find n-grams containing seeds
generalize to textual patterns
2 Search for patterns in n-grams data candidate tuples
3 Finally, rank the candidate tuples, choose output tuples
Features: for a tuple (x, y)
fi (p(x, y)) for each datasource i and pattern p
p∈P
fi (p(x, y)) for each datasource i
13 / 27
Information Extraction from Web-Scale N-Gram Data
47. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
15 / 27
Information Extraction from Web-Scale N-Gram Data
48. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
15 / 27
Information Extraction from Web-Scale N-Gram Data
49. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
15 / 27
Information Extraction from Web-Scale N-Gram Data
50. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
contains n-gram statistics for n = 1 . . . 5
generated from around 1012
words of text
positive: distributed (around 60GB uncompressed)
negative: cut-off frequency 40
15 / 27
Information Extraction from Web-Scale N-Gram Data
51. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
15 / 27
Information Extraction from Web-Scale N-Gram Data
52. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
15 / 27
Information Extraction from Web-Scale N-Gram Data
53. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
15 / 27
Information Extraction from Web-Scale N-Gram Data
54. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
currently 3,4-grams, smoothed language models
generated from around 1.4T tokens, complete English US version
of Bing index
also: statistics from titles (12.5G tokens) and anchor texts (357G
tokens)
WSDL-based web service
15 / 27
Information Extraction from Web-Scale N-Gram Data
55. Experiments
Datasets
1 Google Web 1T 5-Gram Corpus
2 Microsoft Web N-gram Corpus
3 ClueWeb09 5-grams
500 million web pages, 700M 5-grams
15 / 27
Information Extraction from Web-Scale N-Gram Data
56. Experiments
Seeds and Patterns
Patterns
Relation Seeds discovered
isA 100 2991
partOf 100 3883
hasProperty 100 3175
seeds from MIT ConceptNet
even among highest-ranked:
partOf(children,parents) and isA(winning,everything)
16 / 27
Information Extraction from Web-Scale N-Gram Data
57. Experiments
Pattern Examples: isA
Pattern PMI range
<X> and almost any <Y> high
<X> betting basketball betting <Y> high
<X> is my favorite <Y> high
<X> shoes online shoes <Y> high
<X> is a <Y> medium
<X> is the best <Y> medium
<X> or any other <Y> medium
<X> , and <Y> medium
<X> and other smart <Y> medium
<X> and grammar <Y> low
<X> content of the <Y> low
<X> when it changes <Y> low
17 / 27
Information Extraction from Web-Scale N-Gram Data
58. Experiments
Pattern Examples: partOf
Pattern PMI range
<X> with the other <Y> high
<X> of the top <Y> high
<X> online <Y> high
<X> shoes online shoes <Y> high
<X> from the <Y> medium
<X> or even entire <Y> medium
<X> of host <Y> medium
<X> from <Y> medium
<X> of a different <Y> medium
<X> entertainment and <Y> low
<X> Download for thou <Y> low
<X> company home in <Y> low
18 / 27
Information Extraction from Web-Scale N-Gram Data
59. Experiments
Pattern: Microsoft Document Body 3-
grams vs. Anchor 3-grams
(each point represents the sum of pattern scores for a tuple)
19 / 27
Information Extraction from Web-Scale N-Gram Data
60. Experiments
Patterns: Microsoft Document Body 3-
grams vs. Title 3-grams
(each point represents the sum of pattern scores for a tuple)
20 / 27
Information Extraction from Web-Scale N-Gram Data
61. Experiments
Patterns: Microsoft Document Body 3-
grams vs. Google Body 3-grams
(each point represents the sum of pattern scores for a tuple)
21 / 27
Information Extraction from Web-Scale N-Gram Data
62. Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
22 / 27
Information Extraction from Web-Scale N-Gram Data
63. Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
22 / 27
Information Extraction from Web-Scale N-Gram Data
64. Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
22 / 27
Information Extraction from Web-Scale N-Gram Data
65. Experiments
Overall Results
(all data sources simultaneously)
Approach
learning:
RBF-kernel SVMs, also: random forests, C4.5, AdaBoost
∼ 500 random labelled examples per relation
(matching any of the patterns)
10-fold leave one out cross-validation
=⇒ Recall is relative to union of pattern matches
22 / 27
Information Extraction from Web-Scale N-Gram Data
66. Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
22 / 27
Information Extraction from Web-Scale N-Gram Data
67. Experiments
Overall Results
(all data sources simultaneously)
Relation Precision Recall F1 Output
per million
n-grams1
isA 88.9% 8.1% 14.8% 983
partOf 80.5% 34.0% 47.8% 7897
hasProperty 75.3% 99.3% 85.6% 26180
1: the expected number of distinct accepted tuples per million input n-grams
(the total number of 5-grams in the Google Web 1T dataset is ∼1,176 million)
linguistic information implicitly captured via combinations of
patterns!
22 / 27
Information Extraction from Web-Scale N-Gram Data
68. Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Google 3-grams Document Body 55.9% 38.5% 45.6%
Google 4-grams Document Body 52.6% 43.3% 47.5%
Google 5-grams Document Body 48.1% 42.8% 45.3%
ClueWeb 5-grams Document Body 51.7% 35.6% 42.2%
Google 3-/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3-/4-/5-
grams
Document Body 58.7% 43.8% 50.1%
23 / 27
Information Extraction from Web-Scale N-Gram Data
69. Experiments
Detailed Results (partOf relation)
Dataset Source Prec. Recall F1
Microsoft 3-grams Document Body 58.5% 33.2% 42.3%
Microsoft 3-grams Document Title 51.7% 29.8% 37.8%
Microsoft 3-grams Anchor Text 57.3% 36.1% 44.2%
Microsoft 3-grams Body / Title / Anchor 40.4% 100.0% 57.5%
Google 3-grams Document Body 55.9% 38.5% 45.6%
Microsoft 3/4-
grams
Body (3-grams only) /
Title / Anchor
40.5% 98.1% 57.3%
Google 3/4-
grams
Document Body 53.9% 42.8% 47.7%
Google 3/4/5-
grams
Document Body 58.7% 43.8% 50.1%
All 3/4/5-
grams
Body / Title / Anchor 80.5% 34.0% 47.8%
24 / 27
Information Extraction from Web-Scale N-Gram Data
73. Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
27 / 27
Information Extraction from Web-Scale N-Gram Data
74. Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
27 / 27
Information Extraction from Web-Scale N-Gram Data
75. Conclusion
Summary
Lessons Learnt
N-grams datasets allow for
Information Extraction from
petabytes of original data
Requirements: short entity
names, short patterns
more data helps (even at very
large scales)
diversity of data sources helps
27 / 27
Information Extraction from Web-Scale N-Gram Data