Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

From Big Data to Valuable Knowledge

747 views

Published on

Big Data is more than just hype. The vast quantities of data now available have led to two important challenges that are fundamentally changing the way we develop data-intensive systems. The first is at the data management level, where we are finally moving beyond vanilla MapReduce towards infrastructure that allows for more flexible data processing pipelines. The second challenge is transitioning from quantity to quality and distilling genuine knowledge from the raw data. For this, we still need innovative algorithms that facilitate data cleaning, unsupervised and semi-supervised learning, knowledge harvesting, and knowledge integration. Examples include data integration, and large-scale knowledge bases such as UWN/MENTA, and collections of commonsense knowledge such as WebChild.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

From Big Data to Valuable Knowledge

  1. 1. From Big Data to Valuable Knowledge Gerard de Melo, Tsinghua University http://gerard.demelo.org From Big Data to Valuable Knowledge Gerard de Melo, Tsinghua University http://gerard.demelo.org
  2. 2. 25 Years of the World Wide Web: 1989−2014 25 Years of the World Wide Web: 1989−2014 http://geekcom.wordpress.com/2009/03/19/ Tim Berners-Lee
  3. 3. Big Data on the WebBig Data on the WebBig Data on the WebBig Data on the Web Theological Hall, Strahov Monastery Library, Prague
  4. 4. Main Challenge So Far: ScaleMain Challenge So Far: ScaleMain Challenge So Far: ScaleMain Challenge So Far: Scale Matej Kren: Idiom. Prague Municipal Library https://www.flickr.com/photos/ill-padrino/6437837857/
  5. 5. Developing for ScalabilityDeveloping for Scalability
  6. 6. official Hadoop WordCount v1.0 excluding imports and improvements in WordCount v2.0 Developing for ScalabilityDeveloping for Scalability
  7. 7. import com.twitter.scalding._ class WordCountJob(args : Args) extends Job(args) { TextLine(args("input")) .flatMap('line -> 'word) { line : String => line.split("""s+""") } .groupBy('word) { _.size } .write(Tsv(args("output"))) } Developing for ScalabilityDeveloping for Scalability Apache Spark Twitter's Scalding
  8. 8. Knowledge OrganizationKnowledge Organization Image: http://commons.wikimedia.org/wiki/File:Mundaneum_Tir%C3%A4ng_Karteikaarten.jpg Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries
  9. 9. Knowledge OrganizationKnowledge Organization Image: Mundaneum Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Universal Bibliographic Repertory (Repertoire Bibliographique Universel, RBU) by Paul Otlet and Henri La Fontaine in 1895 index cards with answers to queries Alex Wright: This was a sort of “analog search engine” Alex Wright: This was a sort of “analog search engine”
  10. 10. Current Challenge:Current Challenge: Knowledge OrganizationKnowledge Organization Current Challenge:Current Challenge: Knowledge OrganizationKnowledge Organization Alexandre Duret-Lutz https://www.flickr.com/photos/gadl/110845690/
  11. 11. 25 Years of the World Wide Web: 1989−2014 25 Years of the World Wide Web: 1989−2014 HyperText (the “HT” in “HTML”) HyperText (the “HT” in “HTML”) Basic Idea: Connecting Data Basic Idea: Connecting Data http://geekcom.wordpress.com/2009/03/19/ Tim Berners-Lee
  12. 12. 25 Years of the World Wide Web: 1989−2014 25 Years of the World Wide Web: 1989−2014 Source: Ivan Herman. Introduction to Semantic Web Technologies Data really needs to be more connected! Data really needs to be more connected!
  13. 13. The Web of Data: Linked Data The Web of Data: Linked Data
  14. 14. Semantic WebSemantic Web Journal 2014Journal 2014 Semantic WebSemantic Web Journal 2014Journal 2014 InterdisciplinaryInterdisciplinary Work, e.g. inWork, e.g. in Digital HumanitiesDigital Humanities InterdisciplinaryInterdisciplinary Work, e.g. inWork, e.g. in Digital HumanitiesDigital Humanities The Web of Data: Lexvo.org The Web of Data: Lexvo.org
  15. 15. Source: Peter Mika Entity Integration: Challenges Entity Integration: Challenges
  16. 16. Entity Integration: Challenges Entity Integration: Challenges
  17. 17. ACL 2010 AAAI 2013 ACL 2010 AAAI 2013 Entity Integration: Challenges Entity Integration: Challenges
  18. 18. One bad link isOne bad link is enough to make aenough to make a connected componentconnected component inconsistentinconsistent One bad link isOne bad link is enough to make aenough to make a connected componentconnected component inconsistentinconsistent ACL 2010 AAAI 2013 ACL 2010 AAAI 2013 Entity Integration: Challenges Entity Integration: Challenges
  19. 19. Min. cost solution:Min. cost solution: NP-hardNP-hard APX-hardAPX-hard Min. cost solution:Min. cost solution: NP-hardNP-hard APX-hardAPX-hard Entity IntegrationEntity Integration ACL 2010 AAAI 2013 ACL 2010 AAAI 2013 Our Solution:Our Solution: Use Linear Program andUse Linear Program and then apply region growingthen apply region growing techniquestechniques →→ LogarithmicLogarithmic ApproximationApproximation GuaranteeGuarantee Our Solution:Our Solution: Use Linear Program andUse Linear Program and then apply region growingthen apply region growing techniquestechniques →→ LogarithmicLogarithmic ApproximationApproximation GuaranteeGuarantee
  20. 20. Taxonomic Links a user wants a list of „Art Schools in Europe“
  21. 21. Taxonomic Integration: MENTA Approach De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award
  22. 22. Taxonomic Integration: MENTA Approach De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award
  23. 23. Taxonomic Integration: MENTA Approach De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award
  24. 24. Taxonomic Integration: MENTA Approach De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award De Melo & Weikum (2010). CIKM Best Interdisciplinary Paper Award
  25. 25. UWN/MENTA: multilingual extension of WordNet for word senses and taxonomical information over 200 languages Gerard de Melo UWN/MENTAUWN/MENTAUWN/MENTAUWN/MENTA
  26. 26. Relation ExtractionRelation Extraction Images: Denilson Barbosa, Haixun Wang, Cong Yu. Shallow Information Extraction for the Knowlege Web Scaling Up: Tandon, de Melo & Weikum. AAAI 2011, COLING 2012 Scaling Up: Tandon, de Melo & Weikum. AAAI 2011, COLING 2012
  27. 27. Equivalent: MetaWeb was acquired by Google. MetaWeb was just recently acquired by Google. MetaWeb, surprisingly, was acquired by Google. Relation IntegrationRelation Integration MetaWeb was bought out by Google. Google bought MetaWeb. Google acquired MetaWeb. MetaWeb was sold to Google. Google's acquisition of MetaWeb. Google's MetaWeb acquisition. and so on...
  28. 28. Underlying frame: Commercial transfer ● Capture the “who-did-what-to-whom” ● Microsoft bought the patent from Nokia. Nokia sold the patent to Microsoft. The patent was acquired by Microsoft [from Nokia]. The patent was sold [by Nokia] to Microsoft. Relation IntegrationRelation Integration Buyer: Microsoft Seller: Nokia Product: The patent
  29. 29. Relation Integration: FrameBase.org Bringing knowledge into a standard form based on natural language (FrameNet) Bringing knowledge into a standard form based on natural language (FrameNet)
  30. 30. Relation IntegrationRelation Integration X isAuthorOf Y Y writtenBy X X wrote Y Y writtenInYear Z
  31. 31. Relation IntegrationRelation Integration YAGO: isMarriedTo predicateYAGO: isMarriedTo predicate Freebase: Marriage EntityFreebase: Marriage Entity Challenge: Modelling Differences Challenge: Modelling Differences
  32. 32. Search Interfaces “Which companies were created during the last century in Silicon Valley ?” YAGO2: WWW 2011 Best Demo Award YAGO2: WWW 2011 Best Demo Award Gerard de Melo
  33. 33. Real Understanding?Real Understanding? Knowledge Bases keep growing, but much of the Web is still not truly understood Knowledge Bases keep growing, but much of the Web is still not truly understood
  34. 34. Real Understanding? Source: CMU NELL Browser 2015-03-17 Over 4000 countries with >90% confidence Over 4000 countries with >90% confidence Noisy Patterns Noisy Patterns
  35. 35. Future Challenge:Future Challenge: Real UnderstandingReal Understanding Future Challenge:Future Challenge: Real UnderstandingReal Understanding Voynich Manuscript, early 15th century
  36. 36. From Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to KnowledgeFrom Big Data to Knowledge Image: Brett Ryder
  37. 37. Machine LearningMachine Learning Examples Probably Incorrect! LearningLearning PredictionPrediction ClassifierModel Incorrect Correct
  38. 38. Better Machine LearningBetter Machine Learning Examples Probably Incorrect! LearningLearning PredictionPrediction Incorrect Correct ClassifierModel Better Model! + Better Labels for Test Data
  39. 39. ConversationConversation Always there to answer questions Always there to answer questions
  40. 40. Learning Common-SenseLearning Common-Sense Gerard de Melo I'm cold. Warm coffee and tea are available at Costa Coffee just around the corner. But don't forget your meeting with Linda in half an hour!
  41. 41. Learning Common-Sense: From Big Data? Learning Common-Sense: From Big Data?
  42. 42. WebChild AAAI 2014 WSDM 2014 AAAI 2011 WebChild AAAI 2014 WSDM 2014 AAAI 2011 WebChild: Learning Common-Sense From Big Data WebChild: Learning Common-Sense From Big Data
  43. 43. Why do you think Mary put on the ring at the end of the movie? Yes, that was powerful scene. The fact that she put it on after reading the letter from her mother indicates that she may have changed her mind about the value of ... Future: Learning Advanced Common-Sense Knowledge? Future: Learning Advanced Common-Sense Knowledge?
  44. 44. SummarySummarySummarySummary Big Data is radically changing the world Main Challenge in the Past: Scale Main Current Challenge: Organization 1. Entity Integration 2. Taxonomic Integration 3. Relation Extraction and Integration Main Future Challenge: Real Understanding by learning from weak signals

×