SlideShare a Scribd company logo
1 of 33
Download to read offline
A PowerPoint Presentation
PRESENTED BY Firstname Lastname⎪ August 25, 2013
Yahoo Knowledge Graph
M a k i n g K n o w l e d g e R e u s a b l e a t Y a h o o
PRESENTED BY Nicolas Torzec⎪ August 20, 2014
Background & Context
2
Google Knowledge Graph
3
Introducing the Knowledge Graph:
Things, Not Strings.
May 16th, 2012
Bing Knowledge Graph
4
Understand Your World with Bing.
March 21st, 2013
Yahoo Knowledge Graph
5
Yahoo Entity Search.
Soft launch in Nov. 2013
Other “Knowledge Graphs"
6
Rich, domain-specific, graphs Wolfram Alpha, BBC,
Rovi, TMS, Baseline, Gracenote,
Amazon, Walmart Labs
Interest graphs Gravity
Adchemy
Social graphs Facebook
LinkedIn
Reference knowledge graphs Freebase + Yago, Wikidata, DBpedia,
and other Wikipedia-based projects…
Common-sense knowledge graphs Cyc
Scope
7
Vision
8
§  A unified knowledge graph for Yahoo
›  All entities and topics relevant to Yahoo (users)
›  Rich information about entities: facts, relationships, features
›  Identifiers, interlinking across data sources, and links to relevant services
§  To power knowledge-based services at Yahoo
›  Search: display, and search for, information about entities
›  Discovery: relate entities, interconnect data sources, link to relevant services
›  Understanding: recognize entities in queries and text
§  Managed and served by a central knowledge team / platform
Value Proposition
9
§  Data breadth, depth, and accuracy
›  Combine information from multiple complementary/overlapping data sources
§  Centralized expertise
§  Common technologies
§  Same knowledge graph
§  Speed and agility at launching new and richer experiences
leveraged across the company
In a Nutshell
10
Knowledge
Acquisition
Knowledge
Integration
Knowledge
Consumption
Ongoing information extraction,
from complementary sources.
Reconciliation into a unified
knowledge repository.
Enrichment and serving…
Making knowledge reusable at Yahoo
11
Key Tasks
12
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Data Acquisition
13
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Data Acquisition
14
§  Multiple complementary data sources
›  Combine and cross-validate data from authoritative* sources
›  Reference data sources such as Wikipedia and Freebase form our backbone
›  Specialized data sources such as TMS and Music Brainz adds breadth/depth
›  Optimize for relevance, comprehensiveness, correctness, freshness, consistency
§  Ongoing acquisition of raw data
›  Feed acquisition from open data sources and paid providers
›  Web/Targeted crawling, online fetching, ad hoc acquisition (e.g. Wikipedia monitoring)
›  Deal w/ operational complexity: data size, bandwidth, update frequency, license, ©
Information Extraction
15
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Information Extraction
16
§  Extraction of entities, attributes, relationships, features
›  Deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage
›  Expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…)
›  Being able to measure and monitor data quality is key
§  Mixed approach
1.  Parsing of large data feeds and online data APIs
2.  Structured data extraction on the Web: markup, Web scraping, Wrapper induction,
3.  Wikipedia mining, Web mining, News mining, open information extraction
Schema Mapping
17
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Schema Mapping
18
§  Normalization to common ontology, schemas, and data types/units
›  Upfront normalization: uniform data facilitate downstream usage
›  Validation against the ontology to ensure well-formedness, validity, and consistency
§  Challenges
›  Noisy information extraction: e.g. strong types vs. inferred types
›  Discrepancies between source/target ontologies: e.g. can Pal_(dog) be an actor?
›  Schema complexity and schema evolutions…
Ontology alignment <Mad_Men, isA, TVSeries> Classifiers: heuristics + ML
Schema mapping <Jon_Hamm, birthplace, St._Louis> Template-driven ; mostly declarative
Data normalization <Jon_Hamm, birthdate, “1971-03-10”> Common plugins
Knowledge Representation
19
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Knowledge Representation
20
§  Property Graph data model
›  JSON-LD serialization when needed
§  Common ontology
›  OWL ontology. Focuses on representation & validation, not reasoning
›  Covers domains relevant to Yahoo: 300 classes, 500 object properties, 800 data prop.
›  Modeling/managing temporality, provenance, license, localization
›  Soundness, expressiveness and comprehensiveness … vs. practicality
›  Collaborative development, conflicting modeling, schema evolution over time
Challenges
Knowledge Repository
21
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Knowledge Repository
22
§  Present knowledge repository backed by a column-oriented store :-(
›  Store de-normalized graph persistently and provide some random access via 2ndary indices
›  Scale out nicely and smooth integration with Hadoop workflows
›  But simplistic data model and limited API make working with graph data tedious
§  Moving to a graph-oriented repository and workflow engine :-)
›  Scale to 100s of millions of nodes and billions of facts? (processing, storage, retrieval)
›  Mix large record-oriented ETL workflows and distributed graph processing?
›  Efficient graph traversal and query? Built-in inference mechanism?
›  Schema-less? Data versioning?
Challenges
Entity Reconciliation & Blending
23
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
24
raw data raw data Brad Pitt
according to!
Wikipedia
Brad Pitt
according to!
YK
normalized
graph
Source 4 Source 3
normalized
graph
raw data
normalized
graph
Source 2
raw data
normalized
graph
Wikipedia
Unified Graph
Entity Reconciliation & Blending
25
§  Disambiguate and merge entities across/within data sources
Blocking Select candidates most likely to refer to the same real world entity Fast approximate similarity search
Hashing techniques
Scoring Compute similarity score between all pair of candidates ML classifier or heuristics
Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics
Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics
§  Challenges
›  Hard Science and Tech problems !
›  Scale and adapt to new entity types, data sources, data sizes, update frequencies…
›  Ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.
Enrichment
26
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Enrichment
27
§  Enrich the graph with complementary and/or inferred information
›  Generic enrichments vs. context-specific and application-specific enrichments
§  Examples:
›  Entity description cleanup and summarization
›  Ranking of related entities
›  Entity categorization
§  Challenges
›  Integrating, managing, and running a large number of, possibly conflicting, enrichers.
Editorial Curation
28
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Editorial Curation
29
§  Enable editors to perform hot fixes
›  Interactive (and safe) GUI for updating entities and associated information
§  Internal Wall of Shame
›  Typical issues: incorrect/outdated facts, images, categorization (examples below)
›  Occasionally some reconciliation issues: Frankenstein objects!
§  Challenges
›  Instantly reflect editorial updates in knowledge graph and consuming systems
›  Re-evaluate and manage editorial updates over time since they typically blindly overwrite
›  Manage multiple concurrent, and possibly conflicting, editorial updates.
Serving & Publishing
30
Knowledge Acquisition Knowledge Integration Knowledge Consumption
Data Quality Monitoring
Knowledge Repository
(common ontology)
Data Acquisition
Information Extraction
Schema Mapping
Blending
Entity Reconciliation
Enrichment
Editorial Curation
ExportServing
Serving & Publishing
31
§  Online serving
›  Dedicated serving infrastructure powering various online data APIs
›  Search layer provides efficient random access to the graph (and limited traversal)
›  Federation layer integrates transient info from connected services at query time
›  Customization layer provides attribute-level filtering, transformation, formatting
§  Datapack generation
›  Regular datapack generation for offline batch consumption
›  Typically one single generic datapack with all the data
Knowledge-based services at Yahoo
32 Yahoo Confidential & Proprietary
§  Search:
›  display, and search for, information about entities
§  Discovery:
›  relate entities, interconnect data sources, link to relevant services
§  Understanding:
›  recognize entities in queries and text
Yahoo Knowledge Graph
M a k i n g K n o w l e d g e R e u s a b l e
Thank you.
t o r z e c n @ y a h o o - i n c . c o m
T w i t t e r : n i c o l a s t o r z e c
33

More Related Content

What's hot

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining Phi Jack
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)Zenodia Charpy
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and TechniquesPratik Tambekar
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
data mining
data miningdata mining
data mininguoitc
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?Ivan Herman
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics DemoOntotext
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data miningHadi Fadlallah
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesDeepaR42
 
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...suvanni
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentThe Digital Group
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectOntotext
 

What's hot (20)

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Göteborg university(condensed)
Göteborg university(condensed)Göteborg university(condensed)
Göteborg university(condensed)
 
Data Mining Concepts and Techniques
Data Mining Concepts and TechniquesData Mining Concepts and Techniques
Data Mining Concepts and Techniques
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data Mining Overview
Data Mining OverviewData Mining Overview
Data Mining Overview
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
data mining
data miningdata mining
data mining
 
What is New in W3C land?
What is New in W3C land?What is New in W3C land?
What is New in W3C land?
 
web mining
web miningweb mining
web mining
 
Open Data and News Analytics Demo
Open Data and News Analytics DemoOpen Data and News Analytics Demo
Open Data and News Analytics Demo
 
Introduction to Data mining
Introduction to Data miningIntroduction to Data mining
Introduction to Data mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining : Concepts and Techniques
Data Mining : Concepts and TechniquesData Mining : Concepts and Techniques
Data Mining : Concepts and Techniques
 
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...Management of bibliographic metadata - Metadata management at the Leibniz Inf...
Management of bibliographic metadata - Metadata management at the Leibniz Inf...
 
Data Mining
Data MiningData Mining
Data Mining
 
Kdd process
Kdd processKdd process
Kdd process
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
Choosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your ProjectChoosing the Right Graph Database to Succeed in Your Project
Choosing the Right Graph Database to Succeed in Your Project
 

Similar to The Yahoo Knowledge Graph (SemTech 2014)

Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesKarthik Murugesan
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic searchEdgar Meij
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationRinke Hoekstra
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayAmit Sheth
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxNeo4j
 
Mending the Gap between Library's Electronic and Print Collections in ILS and...
Mending the Gap between Library's Electronic and Print Collections in ILS and...Mending the Gap between Library's Electronic and Print Collections in ILS and...
Mending the Gap between Library's Electronic and Print Collections in ILS and...New York University
 
Visualization for Data Analysis: A New Way to Look at Content
Visualization for Data Analysis: A New Way to Look at Content Visualization for Data Analysis: A New Way to Look at Content
Visualization for Data Analysis: A New Way to Look at Content Access Innovations, Inc.
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Templatebutest
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
Falling in and out and in love with Information Architecture
Falling in and out and in love with Information ArchitectureFalling in and out and in love with Information Architecture
Falling in and out and in love with Information ArchitectureLouis Rosenfeld
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Finalguestcaef1d
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madnesssemanticsconference
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryJongwook Woo
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things PayamBarnaghi
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data ModelingVital.AI
 

Similar to The Yahoo Knowledge Graph (SemTech 2014) (20)

Yahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slidesYahoo's Knowledge Graph - 2014 slides
Yahoo's Knowledge Graph - 2014 slides
 
Web-scale semantic search
Web-scale semantic searchWeb-scale semantic search
Web-scale semantic search
 
PhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher ThomasPhD thesis defense of Christopher Thomas
PhD thesis defense of Christopher Thomas
 
Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
 
Mending the Gap between Library's Electronic and Print Collections in ILS and...
Mending the Gap between Library's Electronic and Print Collections in ILS and...Mending the Gap between Library's Electronic and Print Collections in ILS and...
Mending the Gap between Library's Electronic and Print Collections in ILS and...
 
Visualization for Data Analysis: A New Way to Look at Content
Visualization for Data Analysis: A New Way to Look at Content Visualization for Data Analysis: A New Way to Look at Content
Visualization for Data Analysis: A New Way to Look at Content
 
PowerPoint Template
PowerPoint TemplatePowerPoint Template
PowerPoint Template
 
Role of Semantic Web in Health Informatics
Role of Semantic Web in Health InformaticsRole of Semantic Web in Health Informatics
Role of Semantic Web in Health Informatics
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
Falling in and out and in love with Information Architecture
Falling in and out and in love with Information ArchitectureFalling in and out and in love with Information Architecture
Falling in and out and in love with Information Architecture
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
 
Introduction to Big Data: Smart Factory
Introduction to Big Data: Smart FactoryIntroduction to Big Data: Smart Factory
Introduction to Big Data: Smart Factory
 
Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things Semantic technologies for the Internet of Things
Semantic technologies for the Internet of Things
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 

Recently uploaded

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionPriyansha Singh
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real timeSatoshi NAKAHIRA
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptxkhadijarafiq2012
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 

Recently uploaded (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Caco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorptionCaco-2 cell permeability assay for drug absorption
Caco-2 cell permeability assay for drug absorption
 
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Grafana in space: Monitoring Japan's SLIM moon lander in real time
Grafana in space: Monitoring Japan's SLIM moon lander  in real timeGrafana in space: Monitoring Japan's SLIM moon lander  in real time
Grafana in space: Monitoring Japan's SLIM moon lander in real time
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Types of different blotting techniques.pptx
Types of different blotting techniques.pptxTypes of different blotting techniques.pptx
Types of different blotting techniques.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

The Yahoo Knowledge Graph (SemTech 2014)

  • 1. A PowerPoint Presentation PRESENTED BY Firstname Lastname⎪ August 25, 2013 Yahoo Knowledge Graph M a k i n g K n o w l e d g e R e u s a b l e a t Y a h o o PRESENTED BY Nicolas Torzec⎪ August 20, 2014
  • 3. Google Knowledge Graph 3 Introducing the Knowledge Graph: Things, Not Strings. May 16th, 2012
  • 4. Bing Knowledge Graph 4 Understand Your World with Bing. March 21st, 2013
  • 5. Yahoo Knowledge Graph 5 Yahoo Entity Search. Soft launch in Nov. 2013
  • 6. Other “Knowledge Graphs" 6 Rich, domain-specific, graphs Wolfram Alpha, BBC, Rovi, TMS, Baseline, Gracenote, Amazon, Walmart Labs Interest graphs Gravity Adchemy Social graphs Facebook LinkedIn Reference knowledge graphs Freebase + Yago, Wikidata, DBpedia, and other Wikipedia-based projects… Common-sense knowledge graphs Cyc
  • 8. Vision 8 §  A unified knowledge graph for Yahoo ›  All entities and topics relevant to Yahoo (users) ›  Rich information about entities: facts, relationships, features ›  Identifiers, interlinking across data sources, and links to relevant services §  To power knowledge-based services at Yahoo ›  Search: display, and search for, information about entities ›  Discovery: relate entities, interconnect data sources, link to relevant services ›  Understanding: recognize entities in queries and text §  Managed and served by a central knowledge team / platform
  • 9. Value Proposition 9 §  Data breadth, depth, and accuracy ›  Combine information from multiple complementary/overlapping data sources §  Centralized expertise §  Common technologies §  Same knowledge graph §  Speed and agility at launching new and richer experiences leveraged across the company
  • 10. In a Nutshell 10 Knowledge Acquisition Knowledge Integration Knowledge Consumption Ongoing information extraction, from complementary sources. Reconciliation into a unified knowledge repository. Enrichment and serving…
  • 12. Key Tasks 12 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 13. Data Acquisition 13 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 14. Data Acquisition 14 §  Multiple complementary data sources ›  Combine and cross-validate data from authoritative* sources ›  Reference data sources such as Wikipedia and Freebase form our backbone ›  Specialized data sources such as TMS and Music Brainz adds breadth/depth ›  Optimize for relevance, comprehensiveness, correctness, freshness, consistency §  Ongoing acquisition of raw data ›  Feed acquisition from open data sources and paid providers ›  Web/Targeted crawling, online fetching, ad hoc acquisition (e.g. Wikipedia monitoring) ›  Deal w/ operational complexity: data size, bandwidth, update frequency, license, ©
  • 15. Information Extraction 15 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 16. Information Extraction 16 §  Extraction of entities, attributes, relationships, features ›  Deal w/ scale, volatility, heterogeneity, inconsistency, schema complexity, breakage ›  Expensive to build and maintain (i.e. declarative rules, expert’s knowledge, ML…) ›  Being able to measure and monitor data quality is key §  Mixed approach 1.  Parsing of large data feeds and online data APIs 2.  Structured data extraction on the Web: markup, Web scraping, Wrapper induction, 3.  Wikipedia mining, Web mining, News mining, open information extraction
  • 17. Schema Mapping 17 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 18. Schema Mapping 18 §  Normalization to common ontology, schemas, and data types/units ›  Upfront normalization: uniform data facilitate downstream usage ›  Validation against the ontology to ensure well-formedness, validity, and consistency §  Challenges ›  Noisy information extraction: e.g. strong types vs. inferred types ›  Discrepancies between source/target ontologies: e.g. can Pal_(dog) be an actor? ›  Schema complexity and schema evolutions… Ontology alignment <Mad_Men, isA, TVSeries> Classifiers: heuristics + ML Schema mapping <Jon_Hamm, birthplace, St._Louis> Template-driven ; mostly declarative Data normalization <Jon_Hamm, birthdate, “1971-03-10”> Common plugins
  • 19. Knowledge Representation 19 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 20. Knowledge Representation 20 §  Property Graph data model ›  JSON-LD serialization when needed §  Common ontology ›  OWL ontology. Focuses on representation & validation, not reasoning ›  Covers domains relevant to Yahoo: 300 classes, 500 object properties, 800 data prop. ›  Modeling/managing temporality, provenance, license, localization ›  Soundness, expressiveness and comprehensiveness … vs. practicality ›  Collaborative development, conflicting modeling, schema evolution over time Challenges
  • 21. Knowledge Repository 21 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 22. Knowledge Repository 22 §  Present knowledge repository backed by a column-oriented store :-( ›  Store de-normalized graph persistently and provide some random access via 2ndary indices ›  Scale out nicely and smooth integration with Hadoop workflows ›  But simplistic data model and limited API make working with graph data tedious §  Moving to a graph-oriented repository and workflow engine :-) ›  Scale to 100s of millions of nodes and billions of facts? (processing, storage, retrieval) ›  Mix large record-oriented ETL workflows and distributed graph processing? ›  Efficient graph traversal and query? Built-in inference mechanism? ›  Schema-less? Data versioning? Challenges
  • 23. Entity Reconciliation & Blending 23 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 24. 24 raw data raw data Brad Pitt according to! Wikipedia Brad Pitt according to! YK normalized graph Source 4 Source 3 normalized graph raw data normalized graph Source 2 raw data normalized graph Wikipedia Unified Graph
  • 25. Entity Reconciliation & Blending 25 §  Disambiguate and merge entities across/within data sources Blocking Select candidates most likely to refer to the same real world entity Fast approximate similarity search Hashing techniques Scoring Compute similarity score between all pair of candidates ML classifier or heuristics Clustering Decide which candidates refer to the same entity and interlink them ML clustering or heuristics Merging Build a unified object for each cluster. Populate with best properties ML selection or heuristics §  Challenges ›  Hard Science and Tech problems ! ›  Scale and adapt to new entity types, data sources, data sizes, update frequencies… ›  Ongoing reconciliation/blending/evaluation. Need for consistent entity IDs. Provenance.
  • 26. Enrichment 26 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 27. Enrichment 27 §  Enrich the graph with complementary and/or inferred information ›  Generic enrichments vs. context-specific and application-specific enrichments §  Examples: ›  Entity description cleanup and summarization ›  Ranking of related entities ›  Entity categorization §  Challenges ›  Integrating, managing, and running a large number of, possibly conflicting, enrichers.
  • 28. Editorial Curation 28 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 29. Editorial Curation 29 §  Enable editors to perform hot fixes ›  Interactive (and safe) GUI for updating entities and associated information §  Internal Wall of Shame ›  Typical issues: incorrect/outdated facts, images, categorization (examples below) ›  Occasionally some reconciliation issues: Frankenstein objects! §  Challenges ›  Instantly reflect editorial updates in knowledge graph and consuming systems ›  Re-evaluate and manage editorial updates over time since they typically blindly overwrite ›  Manage multiple concurrent, and possibly conflicting, editorial updates.
  • 30. Serving & Publishing 30 Knowledge Acquisition Knowledge Integration Knowledge Consumption Data Quality Monitoring Knowledge Repository (common ontology) Data Acquisition Information Extraction Schema Mapping Blending Entity Reconciliation Enrichment Editorial Curation ExportServing
  • 31. Serving & Publishing 31 §  Online serving ›  Dedicated serving infrastructure powering various online data APIs ›  Search layer provides efficient random access to the graph (and limited traversal) ›  Federation layer integrates transient info from connected services at query time ›  Customization layer provides attribute-level filtering, transformation, formatting §  Datapack generation ›  Regular datapack generation for offline batch consumption ›  Typically one single generic datapack with all the data
  • 32. Knowledge-based services at Yahoo 32 Yahoo Confidential & Proprietary §  Search: ›  display, and search for, information about entities §  Discovery: ›  relate entities, interconnect data sources, link to relevant services §  Understanding: ›  recognize entities in queries and text
  • 33. Yahoo Knowledge Graph M a k i n g K n o w l e d g e R e u s a b l e Thank you. t o r z e c n @ y a h o o - i n c . c o m T w i t t e r : n i c o l a s t o r z e c 33