Big Data is more than just hype. The vast quantities of data now available have led to two important challenges that are fundamentally changing the way we develop data-intensive systems. The first is at the data management level, where we are finally moving beyond vanilla MapReduce towards infrastructure that allows for more flexible data processing pipelines. The second challenge is transitioning from quantity to quality and distilling genuine knowledge from the raw data. For this, we still need innovative algorithms that facilitate data cleaning, unsupervised and semi-supervised learning, knowledge harvesting, and knowledge integration. Examples include data integration, and large-scale knowledge bases such as UWN/MENTA, and collections of commonsense knowledge such as WebChild.
1. From Big Data to
Valuable Knowledge
Gerard de Melo, Tsinghua University
http://gerard.demelo.org
From Big Data to
Valuable Knowledge
Gerard de Melo, Tsinghua University
http://gerard.demelo.org
2. 25 Years of the World Wide Web:
1989−2014
25 Years of the World Wide Web:
1989−2014
http://geekcom.wordpress.com/2009/03/19/
Tim Berners-Lee
3. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web
Theological Hall, Strahov Monastery Library, Prague
4. Main Challenge So Far: ScaleMain Challenge So Far: ScaleMain Challenge So Far: ScaleMain Challenge So Far: Scale
Matej Kren: Idiom. Prague Municipal Library https://www.flickr.com/photos/ill-padrino/6437837857/
7. import com.twitter.scalding._
class WordCountJob(args : Args) extends Job(args) {
TextLine(args("input"))
.flatMap('line -> 'word) { line : String => line.split("""s+""") }
.groupBy('word) { _.size }
.write(Tsv(args("output")))
}
Developing for ScalabilityDeveloping for Scalability
Apache Spark Twitter's Scalding
8. Knowledge OrganizationKnowledge Organization
Image: http://commons.wikimedia.org/wiki/File:Mundaneum_Tir%C3%A4ng_Karteikaarten.jpg
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
9. Knowledge OrganizationKnowledge Organization
Image: Mundaneum
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Universal Bibliographic Repertory
(Repertoire Bibliographique Universel, RBU)
by Paul Otlet and Henri La Fontaine in 1895
index cards with answers to queries
Alex Wright: This was a sort of
“analog search engine”
Alex Wright: This was a sort of
“analog search engine”
10. Current Challenge:Current Challenge:
Knowledge OrganizationKnowledge Organization
Current Challenge:Current Challenge:
Knowledge OrganizationKnowledge Organization
Alexandre Duret-Lutz
https://www.flickr.com/photos/gadl/110845690/
11. 25 Years of the World Wide Web:
1989−2014
25 Years of the World Wide Web:
1989−2014
HyperText
(the “HT” in
“HTML”)
HyperText
(the “HT” in
“HTML”)
Basic Idea:
Connecting Data
Basic Idea:
Connecting Data
http://geekcom.wordpress.com/2009/03/19/
Tim Berners-Lee
12. 25 Years of the World Wide Web:
1989−2014
25 Years of the World Wide Web:
1989−2014
Source: Ivan Herman. Introduction to Semantic Web Technologies
Data really
needs to be more
connected!
Data really
needs to be more
connected!
13. The Web of Data:
Linked Data
The Web of Data:
Linked Data
14. Semantic WebSemantic Web
Journal 2014Journal 2014
Semantic WebSemantic Web
Journal 2014Journal 2014
InterdisciplinaryInterdisciplinary
Work, e.g. inWork, e.g. in
Digital HumanitiesDigital Humanities
InterdisciplinaryInterdisciplinary
Work, e.g. inWork, e.g. in
Digital HumanitiesDigital Humanities
The Web of Data:
Lexvo.org
The Web of Data:
Lexvo.org
18. One bad link isOne bad link is
enough to make aenough to make a
connected componentconnected component
inconsistentinconsistent
One bad link isOne bad link is
enough to make aenough to make a
connected componentconnected component
inconsistentinconsistent
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Entity Integration:
Challenges
Entity Integration:
Challenges
19. Min. cost solution:Min. cost solution:
NP-hardNP-hard
APX-hardAPX-hard
Min. cost solution:Min. cost solution:
NP-hardNP-hard
APX-hardAPX-hard
Entity IntegrationEntity Integration
ACL 2010
AAAI 2013
ACL 2010
AAAI 2013
Our Solution:Our Solution:
Use Linear Program andUse Linear Program and
then apply region growingthen apply region growing
techniquestechniques
→→ LogarithmicLogarithmic
ApproximationApproximation
GuaranteeGuarantee
Our Solution:Our Solution:
Use Linear Program andUse Linear Program and
then apply region growingthen apply region growing
techniquestechniques
→→ LogarithmicLogarithmic
ApproximationApproximation
GuaranteeGuarantee
21. Taxonomic Integration:
MENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
22. Taxonomic Integration:
MENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
23. Taxonomic Integration:
MENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
24. Taxonomic Integration:
MENTA Approach
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
De Melo & Weikum (2010).
CIKM Best Interdisciplinary Paper Award
25. UWN/MENTA: multilingual extension of WordNet for
word senses and taxonomical information over 200 languages
Gerard de Melo
UWN/MENTAUWN/MENTAUWN/MENTAUWN/MENTA
26. Relation ExtractionRelation Extraction
Images: Denilson Barbosa, Haixun Wang, Cong Yu. Shallow Information Extraction for the Knowlege Web
Scaling Up:
Tandon, de Melo & Weikum.
AAAI 2011, COLING 2012
Scaling Up:
Tandon, de Melo & Weikum.
AAAI 2011, COLING 2012
27. Equivalent:
MetaWeb was acquired by Google.
MetaWeb was just recently acquired by Google.
MetaWeb, surprisingly, was acquired by Google.
Relation IntegrationRelation Integration
MetaWeb was bought out by Google.
Google bought MetaWeb.
Google acquired MetaWeb.
MetaWeb was sold to Google.
Google's acquisition of MetaWeb.
Google's MetaWeb acquisition.
and so on...
28. Underlying frame:
Commercial transfer
● Capture the “who-did-what-to-whom”
● Microsoft bought the patent from Nokia.
Nokia sold the patent to Microsoft.
The patent was acquired by Microsoft [from Nokia].
The patent was sold [by Nokia] to Microsoft.
Relation IntegrationRelation Integration
Buyer: Microsoft
Seller: Nokia
Product: The patent
31. Relation IntegrationRelation Integration
YAGO: isMarriedTo predicateYAGO: isMarriedTo predicate
Freebase: Marriage EntityFreebase: Marriage Entity
Challenge:
Modelling
Differences
Challenge:
Modelling
Differences
32. Search Interfaces
“Which companies were created during the
last century in Silicon Valley ?”
YAGO2:
WWW 2011
Best Demo Award
YAGO2:
WWW 2011
Best Demo Award
Gerard de Melo
33. Real Understanding?Real Understanding?
Knowledge Bases keep growing, but
much of the Web is still not truly understood
Knowledge Bases keep growing, but
much of the Web is still not truly understood
34. Real Understanding?
Source: CMU NELL Browser 2015-03-17
Over 4000
countries
with >90%
confidence
Over 4000
countries
with >90%
confidence
Noisy
Patterns
Noisy
Patterns
35. Future Challenge:Future Challenge:
Real UnderstandingReal Understanding
Future Challenge:Future Challenge:
Real UnderstandingReal Understanding
Voynich Manuscript, early 15th century
36. From Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to Knowledge
Image:
Brett Ryder
40. Learning Common-SenseLearning Common-Sense
Gerard de Melo
I'm cold.
Warm coffee and tea are available at
Costa Coffee just around the corner.
But don't forget your meeting with
Linda in half an hour!
42. WebChild
AAAI 2014
WSDM 2014
AAAI 2011
WebChild
AAAI 2014
WSDM 2014
AAAI 2011
WebChild: Learning
Common-Sense From Big Data
WebChild: Learning
Common-Sense From Big Data
43. Why do you think Mary put on the
ring at the end of the movie?
Yes, that was powerful scene. The fact
that she put it on after reading the
letter from her mother indicates
that she may have changed
her mind about the value of ...
Future: Learning Advanced
Common-Sense Knowledge?
Future: Learning Advanced
Common-Sense Knowledge?
44. SummarySummarySummarySummary
Big Data is radically changing the world
Main Challenge in the Past: Scale
Main Current Challenge: Organization
1. Entity Integration
2. Taxonomic Integration
3. Relation Extraction and Integration
Main Future Challenge: Real Understanding
by learning from weak signals