Inaugural lecture at Heinrich-Heine-University Düsseldorf on 28 May 2019.
Abstract:
When searching the Web for information, human knowledge and artificial intelligence are in constant interplay. On the one hand, human online interactions such as click streams, crowd-sourced knowledge graphs, semi-structured web markup or distributional semantic models built from billions of Web documents are informing machine learning and information retrieval models, for instance, as part of the Google search engine. On the other hand, the very same search engines help users in finding relevant documents, facts, or data for particular information needs, thereby helping users to gain knowledge. This talk will give an overview of recent work in both of the aforementioned areas. This includes 1) research on mining structured knowledge graphs of factual knowledge, claims and opinions from heterogeneous Web documents as well as 2) recent work in the field of interactive information retrieval, where supervised models are trained to predict the knowledge (gain) of users during Web search sessions in order to personalise rankings. Both streams of research are converging as part of online platforms and applications to facilitate access to data(sets), information and knowledge.
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
Data science remains a high-touch activity, especially in life, physical, and social sciences. Data management and manipulation tasks consume too much bandwidth: Specialized tools and technologies are difficult to use together, issues of scale persist despite the Cambrian explosion of big data systems, and public data sources (including the scientific literature itself) suffer curation and quality problems.
Together, these problems motivate a research agenda around “human-data interaction:” understanding and optimizing how people use and share quantitative information.
I’ll describe some of our ongoing work in this area at the University of Washington eScience Institute.
In the context of the Myria project, we're building a big data "polystore" system that can hide the idiosyncrasies of specialized systems behind a common interface without sacrificing performance. In scientific data curation, we are automatically correcting metadata errors in public data repositories with cooperative machine learning approaches. In the Viziometrics project, we are mining patterns of visual information in the scientific literature using machine vision, machine learning, and graph analytics. In the VizDeck and Voyager projects, we are developing automatic visualization recommendation techniques. In graph analytics, we are working on parallelizing best-of-breed graph clustering algorithms to handle multi-billion-edge graphs.
The common thread in these projects is the goal of democratizing data science techniques, especially in the sciences.
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
Despite an improved digital access to scientific publications in the last decades, the fundamental principles of scholarly communication remain unchanged and continue to be largely document-based. The document-oriented workflows in science have reached the limits of adequacy as highlighted by recent discussions on the increasing proliferation of scientific literature, the deficiency of peer-review and the reproducibility crisis. We need to represent, analyse, augment and exploit scholarly communication in a knowledge-based way by expressing and linking scientific contributions and related artefacts through semantically rich, interlinked knowledge graphs. This should be based
on deep semantic representation of scientific contributions, their manual, crowd-sourced and automatic augmentation and finally the intuitive exploration and interaction employing question answering on the resulting scientific knowledge base. We need to synergistically combine automated extraction and augmentation techniques, with large-scale collaboration to reach an unprecedented level of knowledge graph breadth and depth. As a result, knowledge-based information flows can facilitate completely new ways of search and exploration. The efficiency and effectiveness of scholarly communication will significant increase, since ambiguities are reduced, reproducibility is facilitated, redundancy is avoided, provenance and contributions can be better traced and the interconnections of research contributions are made more explicit and transparent. In this talk we will present first steps in this direction in the context of our Open Research Knowledge Graph initiative and the ScienceGRAPH project.
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by:
David Smith, Data Scientist at Revolution Analytics, and
Gregory Piatetsky, Editor, KDnuggets
These are the slides for David Smith's portion of the presentation.
Watch the full webinar at:
http://www.kalido.com/data-science.htm
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
I argue why I think that Computer Science (or better: Informatics) is a "natural science", in the same sense that physics, astronomy, biology, psychology and sociology are a natural science: they study a part of the world around us. In that same sense, I think Informatics studies a part of the world around us.
For a similar talk (including script), but more aimed at a Semantic Web audience in particular, see http://www.cs.vu.nl/~frankh/spool/ISWC2011Keynote/
(or http://videolectures.net/iswc2011_van_harmelen_universal/ for a video registration)
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
Towards Knowledge Graph based Representation, Augmentation and Exploration of...Sören Auer
Despite an improved digital access to scientific publications in the last decades, the fundamental principles of scholarly communication remain unchanged and continue to be largely document-based. The document-oriented workflows in science have reached the limits of adequacy as highlighted by recent discussions on the increasing proliferation of scientific literature, the deficiency of peer-review and the reproducibility crisis. We need to represent, analyse, augment and exploit scholarly communication in a knowledge-based way by expressing and linking scientific contributions and related artefacts through semantically rich, interlinked knowledge graphs. This should be based
on deep semantic representation of scientific contributions, their manual, crowd-sourced and automatic augmentation and finally the intuitive exploration and interaction employing question answering on the resulting scientific knowledge base. We need to synergistically combine automated extraction and augmentation techniques, with large-scale collaboration to reach an unprecedented level of knowledge graph breadth and depth. As a result, knowledge-based information flows can facilitate completely new ways of search and exploration. The efficiency and effectiveness of scholarly communication will significant increase, since ambiguities are reduced, reproducibility is facilitated, redundancy is avoided, provenance and contributions can be better traced and the interconnections of research contributions are made more explicit and transparent. In this talk we will present first steps in this direction in the context of our Open Research Knowledge Graph initiative and the ScienceGRAPH project.
The Challenge of Deeper Knowledge Graphs for SciencePaul Groth
Over the past 5 years, we have seen multiple successes in the development of knowledge graphs for supporting science in domains ranging from drug discovery to social science. However, in order to really improve scientific productivity, we need to expand and deepen our knowledge graphs. To do so, I believe we need to address two critical challenges: 1) dealing with low resource domains; and 2) improving quality. In this talk, I describe these challenges in detail and discuss some efforts to overcome them through the application of techniques such as unsupervised learning; the use of non-experts in expert domains, and the integration of action-oriented knowledge (i.e. experiments) into knowledge graphs.
From Text to Data to the World: The Future of Knowledge GraphsPaul Groth
Keynote Integrative Bioinformatics 2018
https://docs.google.com/document/d/1E7D4_CS0vlldEcEuknXjEnSBZSZCJvbI5w1FdFh-gG4/edit
Can we improve research productivity through providing answers stemming from knowledge graphs? In this presentation, I discuss different ways of building and combining knowledge graphs.
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
Thinking about the need for deeper provenance for knowledge graphs but also using knowledge graphs to enrich provenance. Presented at https://seminariomirianandres.unirioja.es/sw19/
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
From the webinar presentation "Data Science: Not Just for Big Data", hosted by Kalido and presented by:
David Smith, Data Scientist at Revolution Analytics, and
Gregory Piatetsky, Editor, KDnuggets
These are the slides for David Smith's portion of the presentation.
Watch the full webinar at:
http://www.kalido.com/data-science.htm
Presentation for NEC Lab Europe.
Knowledge graphs are increasingly built using complex multifaceted machine learning-based systems relying on a wide of different data sources. To be effective these must constantly evolve and thus be maintained. I present work on combining knowledge graph construction (e.g. information extraction) and refinement (e.g. link prediction) in end to end systems. In particular, I will discuss recent work on using inductive representations for link predication. I then discuss the challenges of ongoing system maintenance, knowledge graph quality and traceability.
I argue why I think that Computer Science (or better: Informatics) is a "natural science", in the same sense that physics, astronomy, biology, psychology and sociology are a natural science: they study a part of the world around us. In that same sense, I think Informatics studies a part of the world around us.
For a similar talk (including script), but more aimed at a Semantic Web audience in particular, see http://www.cs.vu.nl/~frankh/spool/ISWC2011Keynote/
(or http://videolectures.net/iswc2011_van_harmelen_universal/ for a video registration)
Bringing Machine Learning and Knowledge Graphs Together
Six Core Aspects of Semantic AI:
- Hybrid Approach
- Data Quality
- Data as a Service
- Structured Data Meets Text
- No Black-box
- Towards Self-optimizing Machines
IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
The Unreasonable Effectiveness of MetadataJames Hendler
Invited talk at VIVO 2017 conference - explores the view of the semantic web as enriched metadata, and how that kind of information can be used in new and interesting ways.
AI in between online and offline discourse - and what has ChatGPT to do with ...Stefan Dietze
Talk at Bonn University on general AI and NLP challenges in the context of online discourse analysis. Specific focus on challenges arising from the widespread adoption of neural large language models.
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
Share and analyse genomic data
at scale with Spark, Adam, Tachyon & the Spark Notebook
Sharp intro to Genomics data
What are the Challenges
Distributed Machine Learning to the rescue
Projects: Distributed teams
Research: Long process
Towards Maximum Share for efficiency
Understanding Scientific and Societal Adoption and Impact of Science Through ...Stefan Dietze
Keynote on analysing scholarly discourse at Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, held on 26 May at ESWC2024
An interdisciplinary journey with the SAL spaceship – results and challenges ...Stefan Dietze
Keynote at HELMeTO2022 conference, Palermo, Italy on recent research in Search As Learning (SAL), at the intersection of machine learning and cognitive psychology.
Analysing User Knowledge, Competence and Learning during Online ActivitiesStefan Dietze
Research talk given at Italian National Research Council (CNR), Institute for Educational Technologies (ITD) on learning analytics in everyday online activities.
Analysing & Improving Learning Resources Markup on the WebStefan Dietze
Talk at WWW2017 on LRMI adoption, quality and usage. Full paper here: http://papers.www2017.com.au.s3-website-ap-southeast-2.amazonaws.com/companion/p283.pdf.
Mining and Understanding Activities and Resources on the WebStefan Dietze
Research Seminar at KMRC Tübingen, Germany, on mining and understanding of Web acivities and resources through knowledge discovery and machine learning approaches.
Semantic Linking & Retrieval for Digital LibrariesStefan Dietze
An overview of recent works on entitiy linking and retrieval in large corpora, specifically bibliographic data. The works address both traditional Linked Data and knowledge graphs as well as data extracted from Web markup, such as the Web Data Commons.
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
Presentation from mentoring event of Open Education Europa Challenge (http://www.openeducationchallenge.eu/) about using Linked Data in educational applications.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Italy Agriculture Equipment Market Outlook to 2027harveenkaur52
Agriculture and Animal Care
Ken Research has an expertise in Agriculture and Animal Care sector and offer vast collection of information related to all major aspects such as Agriculture equipment, Crop Protection, Seed, Agriculture Chemical, Fertilizers, Protected Cultivators, Palm Oil, Hybrid Seed, Animal Feed additives and many more.
Our continuous study and findings in agriculture sector provide better insights to companies dealing with related product and services, government and agriculture associations, researchers and students to well understand the present and expected scenario.
Our Animal care category provides solutions on Animal Healthcare and related products and services, including, animal feed additives, vaccination
From Web Data to Knowledge: on the Complementarity of Human and Artificial Intelligence
1. Backup
29/05/19 1Stefan Dietze
From (Web) Data to Knowledge: on the Complementarity
of Human and Artificial Intelligence
Prof. Dr. Stefan Dietze
Inaugural Lecture, 28 May 2019
Heinrich-Heine-Universität Düsseldorf
2. Finding “things” on the Web
• Resources
• Facts
• Claims
• Opinions
29/05/19 2Stefan Dietze
3. Finding “things” on the Web
• Resources
• Facts
• Claims
• Opinions
29/05/19 3Stefan Dietze
4. Finding “things” on the Web
• Resources
• Facts
• Claims
• Opinions
29/05/19 4Stefan Dietze
5. Finding “things” on the Web
• Resources
• Facts
• Claims
• Opinions
We‘ll try to use AI to „answer“ that
question at the end of the talk.
29/05/19 5Stefan Dietze
7. Human/Crowd Intelligence
Artificial Intelligence
„Supervising AI“ with user-
generated data & knowledge
(„making machines smarter“)
Artificial vs human intelligence: a simplistic Web search perspective
Information retrieval (crawling, indexing,
ranking etc)
Natural language processing
(Hyperlink) graph analysis (e.g. PageRank
et al.)
Statistics and (deep) learning from user
interactions
o Query interpretation & intent prediction
o Classification of users, documents, queries
o Reranking & personalisation
o ….
Facilitating search, retrieval &
knowledge gain of users
„making humans smarter“
29/05/19 7Stefan Dietze
8. Part I
Symbolic & subsymbolic AI on the Web – a brief introduction
Part II
Extracting machine-interpretable knowledge („making machines smarter“)
Part III
Facilitating search, retrieval & knowledge gain of users („making humans smarter“)
Overview
29/05/19 8Stefan Dietze
9. Symbols, data & knowledge on the Web
dbr:Tim_Berners-Lee
dbo:Person
„Tim Berners-Lee“@en
1955-06-08^^xsd:date
dbr:MIT
dbr:Washington_DC
dbr:WWW_Foundation
dbo:Organisation
dbo:keyPersonOf
rdf:type
rdfs:subClassOf
foaf:name
dbo:birthDate
dbo:workplaces
yago:LegalActor
dbo:Scientist
Unstructured data
e.g. web pages, user interactions/behavior, clickstreams, sensor data
Machine-interpretable knowledge
e.g. Knowledge graphs, Web markup
dbr:Jakarta
dbo:location
rdf:type
DBpedia (eng.) 200 million facts
Google KG: 18 billion facts
29/05/19 9Stefan Dietze
10. Symbolic AI
• AI = manipulation and interpretation of
symbols (eventually: “knowledge”)
• Top-down: knowledge representation,
logics, inference, knowledge graphs
• “strong AI hypothesis” or “Physical Symbol
System Hypothesis” (Newell & Simon,
1976), “GOFAI”
Subsymbolic AI
• AI = emulating/engineering human
intelligence, e.g. through cognitive computing
(“perceptron”, Frank Rosenblatt 1957)
• Bottom up: neural networks, machine/deep
learning, distributional semantics
• Also called: “weak AI hypothesis” (Russel &
Norwig, 1995)
Symbolic vs subsymbolic AI
Knowledge
Information
Data
Symbols
Horse ⊓ ¬RockingHorse ⊑ Animal ⊓ ∀(=4)hasLegs
„Intelligence is ten million rules“
(Douglas Lenat, founder of Cyc)
29/05/19 10Stefan Dietze
11. Subsymbolic AI & deep learning for language understanding
Percentage of deep learning papers in major NLP conferences
(Source: Young et al., Recent Trends in Deep Learning Based Natural Language Processing)
• Distributional semantics &
embeddings: predicting low-
dimensional vector representations
of words & text, e.g. Word2Vec
[Mikolov et al., 2013]
• Efficient RNN/CNN architectures in
encoder/decoder settings (e.g. for
machine translation) [Vaswani et al.,
2017]
• Pretraining language models for
task-specific transfer learning, e.g.,
BERT - Bidirectional Encoder
Representations from Transformers
[Devlin et al., 2018]
T. Mikolov et al., Distributed Representations of Words and Phrases and their Compositionality, NIPS (2013)
J. Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
A. Vaswani et al. Attention is all you need, NIPS (2017)
29/05/19 11Stefan Dietze
13. Semantics and knowledge: a brief (and incomplete) history
• Deductive reasoning, syllogism & categorisation
(Aristotele, 384 BC – 322 BC)
• Formal logic & calculus rationicator (reasoning, symbol manipulation)
(G.W. Leibniz 1646 - 1716)
• „Begriffschrift“, technically: predicate logic
(Gottlob Frege, 1848 – 1925)
• Frames for representing stereotyped situations
(Marvin Minsky, 1974)
• Rules & expert systems
• Ontologies
(Leibniz, Kant, Gruber 1994)
• Description Logics
(Baader & Hollunder, 1991 et al.)
• Semantic Web
(Berners-Lee, Hendler, Lassila, 2001)
& Linked Data
& Knowledge Graphs
29/05/19 13Stefan Dietze
14. Symbolic & subsymbolic AI: e.g. linking Web documents & KGs
Robust methods for named entity
disambiguation (NED), e.g. Ambiverse
[Hoffart et al., 2011], Babelfy [Ferragina et al., 2010],
TagMe [Moro et al., 2014]
Time- and corpus-specific entity
relatedness; prior probabilities and
meaning of entities change over time, e.g.
“Deutschland” during World Cup
[DL4KGS 2018]
Meta-EL: supervised ensemble learner
exploiting results of different NED systems
[SAC19, CIKM19]
o Considers features of terms,
mentions/occurrences,
dynamics/temporal drift etc
o Outperforms individual NED systems
across diverse documents/corpora
Problem:
“Completeness” & coverage of KGs?
Fafalios, P., Joao, R.S., Dietze, S., Same but Different: Distant
Supervision for Predicting and Understanding Entity Linking
Difficulty, ACM SAC19
Mohapatra, N., Iosifidis, V., Ekbal, A., Dietze, S., Fafalios, P., Time-
Aware and Corpus-Specific Entity Relatedness, DL4KGS at ESWC2018.
dbr:Tim_Berners-Lee
29/05/19 14
15. Overview
Part I
Symbolic & subsymbolic AI on the Web – a brief introduction
Part II
Extracting machine-interpretable knowledge („making machines smarter“)
Part III
Facilitating search, retrieval & knowledge gain of users („making humans smarter“)
29/05/19 15Stefan Dietze
16. Knowledge about: facts, claims, stances & opinions on the Web
Facts & claims Stances, opinions, interactions
<„Tim Berners-Lee“ s:founderOf „Solid“>
29/05/19 16Stefan Dietze
17. Mining (long-tail) facts from the Web?
<„Tim Berners-Lee“ s:founderOf „Solid“>
Obtaining verified facts (or knowledge graph) for a
given entity?
Application of NLP (e.g. NER, relation extraction) at
Web-scale (Google index: 50 trn pages)?
Exploiting entity-centric embedded Web page markup
(schema.org), prevalent in roughly 40% off Web pages
(44 Bn „facts“ in Common Crawl 2016/3.2 Bn Web
pages)
Challenges
o Errors. Factual errors, annotation errors (see also
[Meusel et al, ESWC2015])
o Ambiguity & coreferences. e.g. 18.000 entity
descriptions of “iPhone 6” in Common Crawl 2016
& ambiguous literals (e.g. „Apple“>)
o Redundancies & conflicts vast amounts of
equivalent or conflicting statements
29/05/19 17Stefan Dietze
18. 0. Noise: data cleansing (node URIs, deduplication etc)
1.a) Scale: Blocking (BM25 entity retrieval) on markup index
1.b) Relevance: supervised coreference resolution
2.) Quality & redundancy: data fusion through supervised fact classification (SVM, knn, RF, LR, NB), diverse
feature set (authority, relevance etc), considering source- (eg PageRank), entity-, & fact-level
KnowMore: data fusion on markup
1. Blocking &
coreference
resolution
2. Fusion / Fact selection
New Query Entities
BBC Audio, type:(Organization)
Chapman & Hall, type:(Publisher)
Put Out More Flags, type:(Book)
(supervised)
Entity Description
author Evelyn Waugh
priorWork Put Out More Flags
ISBN 978031874803074
copyrightHolder Evelyn Waugh
releaseDate 1945
… …
Query Entity
Brideshead Revisited,
type:(Book)
Candidate Facts
node1 publisher Chapman & Hall
node1 releaseDate 1945
node1 publishDate 1961
node2 country UK
node2 publisher Black Bay Books
node3 country US
node3 copyrightHolder Evelyn Waugh
… …. ….
Web page
markup
Web crawl
(Common Crawl,
44 bn facts)
approx. 5000 facts for „Brideshead Revisited“
(compare: 125.000 facts for „iPhone6“)
Yu, R., [..], Dietze, S., KnowMore-Knowledge Base
Augmentation with Structured Web Markup, Semantic
Web Journal 2019 (SWJ2019)
Tempelmeier, N., Demidova, S., Dietze, S., Inferring
Missing Categorical Information in Noisy and Sparse
Web Markup, The Web Conf. 2018 (WWW2018)
20 correct/non-redundant
facts for „Brideshead Rev.“
18Stefan Dietze
Fusion performance
Baselines: BM25, CBFS [ESWC2015], PreRecCorr [Pochampally
et. al., ACM SIGMOD 2014], strong variance across types
Knowledge Graph Augmentation
Experiments on books, movies, products
New facts (wrt DBpedia, Wikidata, Freebase):
On average 60% - 70% of all facts for books & movies new
(across KBs)
100% new facts for long-tail entities (e.g. products)
Additional experiments on learning new categorical features
(e.g. product categories or movie genres) [WWW2018]
19. Beyond facts: claims, opinions and misinformation on the Web
Investigations into misinformation and opinion forming
received massive attention across a wide range of
disciplines and industries (e.g. [Vousoughi et al. 2018])
Insights, mostly (computational) social sciences, e.g.
o Spreading of claims and misinformation
o Effect of biased and fake news on public opinions
o Reinforcement of biases and echo chambers
Methods, mostly in computer science, e.g. for
o Claim/fact detection and verification („fake news
detection“), e.g. CLEF 2018 Fact Checking Lab
(http://alt.qcri.org/clef2018-factcheck/)
o Stance detection, e.g. Fake News Challenge (FNC)
http://www.fakenewschallenge.org/
Some recent work
o Large-scale public research corpora for
replicating/improving methods/insights
o TweetsKB: 9 Bn annotated tweets
o ClaimsKG: 30 K annotated claims & truth ratings
o ML models for stance detection of Web documents
(towards given claims)
19Stefan Dietze
20. Stance detection of Web documents
Motivation
Problem: detecting stance of documents (Web pages)
towards a given claim (unbalanced class distribution)
Motivation: stance of documents (in particular
disagreement) useful (a) as signal for fake news
detection and (b) Website classification
Approach
Cascading binary classifiers: addressing individual
issues (e.g. misclassification costs) per step
Features, e.g. textual similarity (Word2Vec etc),
sentiments, LIWC, etc.
Best-performing models: 1) SVM with class-wise
penalty, 2) CNN, 3) SVM with class-wise penalty
Experiments on FNC-1 dataset (and FNC baselines)
Results
Minor overall performance improvement
Improvement on disagree class by 27%
(but still far from robust)
A. Roy, A. Ekbal, S. Dietze, P. Fafalios, Step-by-Step: A three-
stage Pipeline for Stance Classification of Documents
towards Claims, CIKM19 under review.
20Stefan Dietze
21. http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
onyx:hasEmotionIntensity "0.0"
Mining opinions & interactions (the case of Twitter)
Heterogenity: multimodal, multilingual, informal,
“noisy” language
Context dependence: interpretation of
tweets/posts (entities, sentiments) requires
consideration of context (e.g. time, linked
content), “Dusseldorf” => City or Football team
Dynamics & scale: e.g. 6000 tweets per second,
plus interactions (retweets etc) and context (e.g.
25% of tweets contain URLs)
Evolution and temporal aspects: evolution of
interactions over time crucial for many social
sciences questions
Representativity and bias: demographic
distributions not known a priori in archived data
collections
http://dbpedia.org/resource/Solid
wna:negative-emotion
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
22. P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public
and Large-Scale RDF Corpus of Annotated Tweets, ESWC'18.
Mining knowledge about opinions & interactions: TweetsKB
http://l3s.de/tweetsKB
Harvesting & archiving of 9 Bn tweets over 5 years
(permanent collection from Twitter 1% sample since
2013)
Information extraction pipeline (distributed via Hadoop
Map/Reduce)
o Entity linking with knowledge graph/DBpedia
(Yahoo‘s FEL [Blanco et al. 2015])
(“president”/“potus”/”trump” =>
dbp:DonaldTrump), to disambiguate text and use
background knowledge (eg US politicians?
Republicans?), high precision (.85), low recall (.39)
o Sentiment analysis/annotation using SentiStrength
[Thelwall et al., 2012], F1 approx. .80
o Extraction of metadata and lifting into established
schemas (SIOC, schema.org), publication using W3C
standards (RDF/SPARQL)
Use cases
Aggregating sentiments towards topics/entities, e.g. about
CDU vs SPD politicians in particular time period
Temporal analytics: evolution of popularity of entities/topics
over time (e.g. for detecting events or trends, such as rise of
populist parties)
Twitter archives as general corpus for understanding temporal
entity relatedness (e.g. “austerity” & “Greece” 2010-2015)
Limitations
Bias & representativity: demographic distributions of users
(not known a priori and not representative)
Cf. use case at the end of the talk
-0.40000
-0.30000
-0.20000
-0.10000
0.00000
0.10000
0.20000
0.30000
0.40000
Cologne Düsseldorf
23. Overview
Part I
Symbolic & subsymbolic AI on the Web – a brief introduction
Part II
Extracting machine-interpretable knowledge („making machines smarter“)
Part III
Facilitating search, retrieval & knowledge gain of users („making humans smarter“)
23Stefan Dietze
24. Knowledge (gain) while searching the Web (“Search As Learning”)?
Challenges & results
Detecting coherent search missions?
Detecting learning throughout search?
detecting “informational” search missions (as
opposed to “transactional” or “navigational”
missions [Broder, 2002])
o Search mission classification with average F1
score 75%
How competent is the user? –
Predict/understand knowledge state of users
based on in-session behavior/interactions
How well does a user achieve his/her learning
goal/information need? - Predict knowledge gain
throughout search missions
o Correlation of user behavior (queries,
browsing, mouse traces, etc) & user
knowledge gain/state in search [CHIIR18]
o Prediction of knowledge gain/state through
supervised models [SIGIR18]
24Stefan Dietze
25. Understanding knowledge gain/state of user during search?
Data collection
Crowdsourced collection of search session data
10 search topics (e.g. “Altitude sickness”, “Tornados”), incl. pre-
and post-tests
Approx. 1000 distinct crowd workers & 100 sessions per topic
Tracking of user behavior through 76 features in 5 categories
(session, query, SERP – search engine result page, browsing,
mouse traces)
Some results
70% of users exhibited a knowledge gain (KG)
Negative relationship between KG of users and topic popularity
(avg. accuracy of workers in knowledge tests) (R= -.87)
Amount of time users actively spent on web pages describes 7%
of the variance in their KG
Query complexity explains 25% of the variance in the KG of users
Topic-dependent behavior: search behavior correlates stronger
with search topic than with KG/KS
Gadiraju, U., Yu, R., Dietze, S., Holtz, P.,. Analyzing
Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM CHIIR 2018.
25Stefan Dietze
26. 26Stefan Dietze
Predicting knowledge gain/state of user during search?
Stratification into classes: user knowledge state (KS) and
knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
Supervised multiclass classification (Naive Bayes, Logistic
regression, SVM, random forest, multilayer perceptron)
KG prediction performance results (after 10-fold cross-validation)
Feature importance (KG prediction)
Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S.,
Analyzing Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM SIGIR 2018.
27. Yu, R., Gadiraju, U., Holtz, P., Rokicki, M., Kemkes, P., Dietze, S.,
Analyzing Knowledge Gain of Users in Informational Search
Sessions on the Web. ACM SIGIR 2018.
Predicting knowledge gain/state of user during search?
29/05/19 27Stefan Dietze
Stratification into classes: user knowledge state (KS) and
knowledge gain (KG) into {low, moderate, high} using
(low < (mean ± 0.5 SD) < high)
Supervised multiclass classification (Naive Bayes, Logistic
regression, SVM, random forest, multilayer perceptron)
KG prediction performance results (after 10-fold cross-validation)
Feature importance (KG prediction)
Shortcomings & future work
Lab studies to obtain more reliable data (controlled
environment, longer sessions) & additional features (eye-
tracking)
Resource features (complexity, analytic/emotional
language, multimodality etc) as additional signals
[CIKM2019, under review]
Improving ranking/retrieval in Web search or other
archives
(SALIENT project, Leibniz Cooperative Excellence)
28. Applications: social sciences research data on the Web
28Stefan Dietze
Improving findability of
(social science) research data
Mining novel (social science)
research data from the Web
http://l3s.de/tweetsKB
https://data.gesis.org/claimskg
29. Finally: can we use AI & the Web to answer THE question?
29Stefan Dietze
30. 30Stefan Dietze
P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze,
TweetsKB: A Public and Large-Scale RDF Corpus of
Annotated Tweets, ESWC'18.
http://dbpedia.org/resource/Tim_Berners-Lee
wna:positive-emotion
onyx:hasEmotionIntensity "0.75"
onyx:hasEmotionIntensity "0.0"
Recap: “Web-mined opinions” in Tweets KB
http://l3s.de/tweetsKB
http://dbpedia.org/resource/Solid
wna:negative-emotion
Total # tweets mentioning (K, D) in 1.5 bn tweets:
• # dbp:Cologne: 89.564
• # dbp:Dusseldorf: 4723
• Opinions in terms of expressed sentiments?
• „Happiness (X) = mean of sentiment score
delta (positive - negative) of all Tweets
mentioning X“
32. Acknowledgements
Co-authors
• Katarina Boland (GESIS, Germany)
• Elena Demidova (L3S, Germany)
• Asif Ekbal (IIT Patna, India)
• Pavlos Fafalios (L3S, Germany)
• Ujwal Gadiraju (L3S, Germany)
• Peter Holtz (IWM, Germany)
• Eirini Ntoutsi (LUH, Germany)
• Vasilis Iosifidis (L3S, Germany)
• Markus Rokicki (L3S, Germany)
• Arjun Roy (IIT Patna, India)
• Renato Stoffalette Joao (L3S, Germany)
• Davide Taibi (CNR, ITD, Italy)
• Nicolas Tempelmeier (L3S, Germany)
• Konstantin Todorov (LIRMM, France)
• Ran Yu (GESIS, Germany)
• Benjamin Zapilko (GESIS, Germany)
32Stefan Dietze
33. From (Web) Data to Knowledge: on the Complementarity
of Human and Artificial Intelligence
Prof. Dr. Stefan Dietze
Heinrich-Heine-Universität Düsseldorf
GESIS Leibniz Institute for the Social Sciences