SlideShare a Scribd company logo
1
© Searchmetrics. All rights reserved. Do not distribute without permission.
Enriching content with Knowledge Base
by Search Keywords and Wikidata
Fang Xu
f.xu@searchmetrics.com
@allxufang
2
© Searchmetrics. All rights reserved. Do not distribute without permission.
Data Science@Searchmetrics
Data driven search and content optimization marketing
• Learning from keywords
• Content optimization
• Data visualization
3
© Searchmetrics. All rights reserved. Do not distribute without permission.
Looooots of Data
• 120 Million Domains
• 600 Million Keywords
• 120 Billion Links
• 25,000 Billion Social Signals
• 25 PB raw data
4
© Searchmetrics. All rights reserved. Do not distribute without permission.
Authors submit content
ü Rate the content’s effectiveness
ü Feedback to optimize and enrich it
Content Production in Real-time
5
© Searchmetrics. All rights reserved. Do not distribute without permission.
Beyond keywords
• Keyword
• Typos
• Ambiguous
• Sparse
• Entity
• Augmented with
metadata
• Relations among entities
6
© Searchmetrics. All rights reserved. Do not distribute without permission.
Q64
Entity
7
© Searchmetrics. All rights reserved. Do not distribute without permission.
8
© Searchmetrics. All rights reserved. Do not distribute without permission.http://brendangriffen.com/blog/gow-programming-languages
Knowledge Base (KB)
9
© Searchmetrics. All rights reserved. Do not distribute without permission.
2001
2012
2014
2008
Knowledge vaults
2012
2005
KB Timeline
10
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Free collaborative KB
• Continuous evolution
• Open multilingual Data
• mapping to other KBs
Why Wikidata
11
© Searchmetrics. All rights reserved. Do not distribute without permission.
Link content to KB
• Entity Linking -- free text to entities
• Blog posts
• Tweets
• Keywords
• User-generated Contents
• Entities from a knowledge base
• Wikipedia
• Wikidata
• Domain-specific KBs
12
© Searchmetrics. All rights reserved. Do not distribute without permission.
Image from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM 2008
Entity Linking
13
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Identify important keywords to link in the text
• Link to right entity
Main Problems
14
© Searchmetrics. All rights reserved. Do not distribute without permission.
Dictionary of keywords to KB entities
Search keyword mentions in text
15
© Searchmetrics. All rights reserved. Do not distribute without permission.
Keyword to wiki uris in top SERP
16
© Searchmetrics. All rights reserved. Do not distribute without permission.
Not all keywords are useful
Keyword Cleaning:
• Navigational or factual words
• Non-frequent words
• Non-latin letters
17
© Searchmetrics. All rights reserved. Do not distribute without permission.
Keyword Filtering:
• Starting or ending tokens
• Stopwords
• Part-of-speech tags
• Wikipedia popularity:
• popular wiki uris for one keyword
• Search popularity:
• popular keywords for one wiki uri
Not all keywords are useful
18
© Searchmetrics. All rights reserved. Do not distribute without permission.
Search Popularity Filtering
Keyword Search Popularity (Volume)
germany 268583
germany facts 4291
germany article 24
german encyclopedia 23
germany encyclopedia 19
germany t 18
ger many 16
19
© Searchmetrics. All rights reserved. Do not distribute without permission.
parse wikidata
dump & extract
entities as json
Entity data
{
entity: "Berlin",
Freebase Id: "/m/0156q",
OpenStreetMap Relation identifier: 62422,
alias: ["Berlin, Germany"],
capital of:
[ "Germany", "Kingdom of Prussia", "Weimar Republic",
"Brandenburg-Prussia", "Free State of Prussia", ... ],
contains administrative territorial entity:
[ "Mitte", "Friedrichshain-Kreuzberg", "Pankow",
"Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf",
"Tempelhof-Schöneberg", "Neukölln", "Treptow-Köpenick", ... ],
coordinate location:
[ {
altitude: null,
latitude: 52.516666666667,
longitude: 13.383333333333,
precision: 0.016666666666667
} ],
country: "Germany",
... ... }
20
© Searchmetrics. All rights reserved. Do not distribute without permission.
Link to the right Wikipedia entity
Word Sense Disambiguation
21
© Searchmetrics. All rights reserved. Do not distribute without permission.
d
Tree 92.82%
Tree (graph theory) 2.94%
Tree (data structure) 2.57%
Tree (set theory) 0.15%
Phylogenetic tree 0.07%
Christmas tree 0.07%
Binary tree 0.04%
Family tree 0.04%
… ...
Link to Most Common Entities
e ew
ew
L
L i
,
,
ew entity,textsurfacewith
LinksofNumber
Entity Wikipedia Commnoness
(Milne and Witten 2008b)
tree
22
© Searchmetrics. All rights reserved. Do not distribute without permission.
https://en.wikipedia.org/wiki/Tree_data_structure
https://en.wikipedia.org/wiki/Tree
Disambiguation
23
© Searchmetrics. All rights reserved. Do not distribute without permission.
Disambiguation using context
24
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Build a Word2Vec model for Wikiepdia entity
• Calculate Word2Vec similarity to contextual entities
 
contextcontext
TreestructureTree_data_ )(similarity)(similarity
Entity Disambiguation
25
© Searchmetrics. All rights reserved. Do not distribute without permission.
Relatedness between Entities
26
© Searchmetrics. All rights reserved. Do not distribute without permission.
Image from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links
Entity Relatedness
27
© Searchmetrics. All rights reserved. Do not distribute without permission.
• Jaccard similarity
• Word2Vec similarity of entity to context
ee
ee


andentitytolinksofUnion
andentitytolinksofonIntersecti
Relatedness Score
28
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Data Parsing
29
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Dump
'''Berlin''' is the [[Capital city|capital]] of [[Germany]] and one of its 16
[[states of Germany|states]]. With a population of approximately 3.5
million people,<ref name="Population" /> Berlin is the second [[Largest
cities of the European Union by population within city limits|most
populous city proper]] and the seventh [[Largest urban areas of the
European Union|most populous urban area]] in the [[European Union]].
30
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Article as Json
31
© Searchmetrics. All rights reserved. Do not distribute without permission.
Word2Vector Training
• Collection of plain article text
... ...
can4linux ||open_source|| ||controller_area_network|| ||linux_kernel||
||device_driver||
development started 1990s philips 82c200 controller stand chip
1995 version created bus linux laboratory automation project linux lab project
||freie_universität_berlin||
nxp sja1000 successor supported controller philips 82c200 intel 82527
development powerful ||microcontroller||s integrated controllers capable
... ...
32
© Searchmetrics. All rights reserved. Do not distribute without permission.
Linking vectors
• Pairs of uri, annotations
outlink vector [Capital_City, Germany , States_of_Germany, European_Union,
Spree, Havel, Berlin-Brandenburg_Metropolitan_Region, ... ... ]
inlink vector [Germany, Prussia, Berlin_Wall, Albert_Einstein, Kosmos_(Berlin),
Berlin_International_Film_Festival, .. .. ]
33
© Searchmetrics. All rights reserved. Do not distribute without permission.
Wikipedia Popularity
• Aggregation of annotations
Surface text Wiki entity Popularity
United States United_States 174338
World War II World_War_II 106483
India India 95966
France France 94666
American United_States 85976
Iran Iran 83249
Australia Australia 76655
Germany Germany 76384
34
© Searchmetrics. All rights reserved. Do not distribute without permission.
Overall System
Keyword
Database
Keyword
Processing
Parser
User
Content
Keyword
Matching
Disam-
biguation
Relatedness
calculation Result
Wikipedia
Popularity
Entity Linking API
Wiki
Parser
W2V
Model
Wiki
LinksKeyword
to KB
entities
35
© Searchmetrics. All rights reserved. Do not distribute without permission.
• https://github.com/piskvorky/gensim
• https://github.com/jodaiber/Annotated-WikiExtractor
• https://dumps.wikimedia.org/
• https://dumps.wikimedia.org/wikidatawiki/entities/
Resources
36
© Searchmetrics. All rights reserved. Do not distribute without permission.
Thank you
37
© Searchmetrics. All rights reserved. Do not distribute without permission.
Questions?
f.xu@searchmetrics.com
We are hiring

More Related Content

Viewers also liked

Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
PyData
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
PyData
 
Crushing the Head of the Snake by Robert Brewer PyData SV 2014
Crushing the Head of the Snake by Robert Brewer PyData SV 2014Crushing the Head of the Snake by Robert Brewer PyData SV 2014
Crushing the Head of the Snake by Robert Brewer PyData SV 2014
PyData
 
Interactive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
Interactive Financial Analytics with Python & Ipython by Dr Yves HilpischInteractive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
Interactive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
PyData
 
How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...
PyData
 
Python resampling
Python resamplingPython resampling
Python resampling
PyData
 
Faster Python Programs Through Optimization by Dr.-Ing Mike Muller
Faster Python Programs Through Optimization by Dr.-Ing Mike MullerFaster Python Programs Through Optimization by Dr.-Ing Mike Muller
Faster Python Programs Through Optimization by Dr.-Ing Mike Muller
PyData
 
Doing frequentist statistics with scipy
Doing frequentist statistics with scipyDoing frequentist statistics with scipy
Doing frequentist statistics with scipy
PyData
 
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
PyData
 
Promoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices EnvironmentPromoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices Environment
PyData
 
Making your code faster cython and parallel processing in the jupyter notebook
Making your code faster   cython and parallel processing in the jupyter notebookMaking your code faster   cython and parallel processing in the jupyter notebook
Making your code faster cython and parallel processing in the jupyter notebook
PyData
 
Large scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartlLarge scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartl
PyData
 

Viewers also liked (12)

Numba: Flexible analytics written in Python with machine-code speeds and avo...
Numba:  Flexible analytics written in Python with machine-code speeds and avo...Numba:  Flexible analytics written in Python with machine-code speeds and avo...
Numba: Flexible analytics written in Python with machine-code speeds and avo...
 
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
Dark Data: A Data Scientists Exploration of the Unknown by Rob Witoff PyData ...
 
Crushing the Head of the Snake by Robert Brewer PyData SV 2014
Crushing the Head of the Snake by Robert Brewer PyData SV 2014Crushing the Head of the Snake by Robert Brewer PyData SV 2014
Crushing the Head of the Snake by Robert Brewer PyData SV 2014
 
Interactive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
Interactive Financial Analytics with Python & Ipython by Dr Yves HilpischInteractive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
Interactive Financial Analytics with Python & Ipython by Dr Yves Hilpisch
 
How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...How Soon is Now: automatically extracting publication dates of news articles ...
How Soon is Now: automatically extracting publication dates of news articles ...
 
Python resampling
Python resamplingPython resampling
Python resampling
 
Faster Python Programs Through Optimization by Dr.-Ing Mike Muller
Faster Python Programs Through Optimization by Dr.-Ing Mike MullerFaster Python Programs Through Optimization by Dr.-Ing Mike Muller
Faster Python Programs Through Optimization by Dr.-Ing Mike Muller
 
Doing frequentist statistics with scipy
Doing frequentist statistics with scipyDoing frequentist statistics with scipy
Doing frequentist statistics with scipy
 
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014Low-rank matrix approximations in Python by Christian Thurau PyData 2014
Low-rank matrix approximations in Python by Christian Thurau PyData 2014
 
Promoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices EnvironmentPromoting a Data Driven Culture in a Microservices Environment
Promoting a Data Driven Culture in a Microservices Environment
 
Making your code faster cython and parallel processing in the jupyter notebook
Making your code faster   cython and parallel processing in the jupyter notebookMaking your code faster   cython and parallel processing in the jupyter notebook
Making your code faster cython and parallel processing in the jupyter notebook
 
Large scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartlLarge scale-ctr-prediction lessons-learned-florian-hartl
Large scale-ctr-prediction lessons-learned-florian-hartl
 

Similar to Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data ConferenceVisualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
sredmore
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
Neo4j
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
Jonathan Challener
 
Bigowl aitech
Bigowl aitechBigowl aitech
Webinar - Maximize Your Library Technology - 2016-05-24
Webinar - Maximize Your Library Technology - 2016-05-24Webinar - Maximize Your Library Technology - 2016-05-24
Webinar - Maximize Your Library Technology - 2016-05-24
TechSoup
 
Tech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @CriteoTech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @Criteo
Gilles Legoux
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Neo4j
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
Sri Ambati
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
Jie Bao
 
AI Is My Co-Pilot - DevWeek17
AI Is My Co-Pilot - DevWeek17AI Is My Co-Pilot - DevWeek17
AI Is My Co-Pilot - DevWeek17
Builtio
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in Action
Jesse Wang
 
Chunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector DatabasesChunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector Databases
Zilliz
 
Week 5 - Interactive News Editing and Producing
Week 5 - Interactive News Editing and ProducingWeek 5 - Interactive News Editing and Producing
Week 5 - Interactive News Editing and Producing
kurtgessler
 
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 20207 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
TechSoup
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
مركز البحوث الأقسام العلمية
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Mark Rittman
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
Latest technology trends Microsoft
Latest technology trends MicrosoftLatest technology trends Microsoft
Latest technology trends Microsoft
Manikandan Suriyamoorthy
 
ICSE 2017 Keynote: Open Collaboration at Eclipse
ICSE 2017 Keynote: Open Collaboration at EclipseICSE 2017 Keynote: Open Collaboration at Eclipse
ICSE 2017 Keynote: Open Collaboration at Eclipse
Mike Milinkovich
 

Similar to Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata (20)

Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data ConferenceVisualizing Text: Seth Redmore at the 2015 Smart Data Conference
Visualizing Text: Seth Redmore at the 2015 Smart Data Conference
 
Relationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine LearningRelationships Matter: Using Connected Data for Better Machine Learning
Relationships Matter: Using Connected Data for Better Machine Learning
 
The path to an hybrid open source paradigm
The path to an hybrid open source paradigmThe path to an hybrid open source paradigm
The path to an hybrid open source paradigm
 
Bigowl aitech
Bigowl aitechBigowl aitech
Bigowl aitech
 
Webinar - Maximize Your Library Technology - 2016-05-24
Webinar - Maximize Your Library Technology - 2016-05-24Webinar - Maximize Your Library Technology - 2016-05-24
Webinar - Maximize Your Library Technology - 2016-05-24
 
Tech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @CriteoTech Job Conference: Software Engineer @Criteo
Tech Job Conference: Software Engineer @Criteo
 
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
Workshop -  Architecting Innovative Graph Applications- GraphSummit MilanWorkshop -  Architecting Innovative Graph Applications- GraphSummit Milan
Workshop - Architecting Innovative Graph Applications- GraphSummit Milan
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
 
Semantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer AppsSemantic Web: In Quest for the Next Generation Killer Apps
Semantic Web: In Quest for the Next Generation Killer Apps
 
AI Is My Co-Pilot - DevWeek17
AI Is My Co-Pilot - DevWeek17AI Is My Co-Pilot - DevWeek17
AI Is My Co-Pilot - DevWeek17
 
Semantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in ActionSemantic Wikis - Social Semantic Web in Action
Semantic Wikis - Social Semantic Web in Action
 
Chunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector DatabasesChunking, Embeddings, and Vector Databases
Chunking, Embeddings, and Vector Databases
 
Week 5 - Interactive News Editing and Producing
Week 5 - Interactive News Editing and ProducingWeek 5 - Interactive News Editing and Producing
Week 5 - Interactive News Editing and Producing
 
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 20207 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
7 Things Your Nonprofit Can Do to Get the Most out of Your Website in 2020
 
T presentation
T presentationT presentation
T presentation
 
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرنمحاضرة برنامج Nails  لتحليل الدراسات السابقة د.شروق المقرن
محاضرة برنامج Nails لتحليل الدراسات السابقة د.شروق المقرن
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
Latest technology trends Microsoft
Latest technology trends MicrosoftLatest technology trends Microsoft
Latest technology trends Microsoft
 
ICSE 2017 Keynote: Open Collaboration at Eclipse
ICSE 2017 Keynote: Open Collaboration at EclipseICSE 2017 Keynote: Open Collaboration at Eclipse
ICSE 2017 Keynote: Open Collaboration at Eclipse
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 

Recently uploaded (20)

Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 

Fang Xu- Enriching content with Knowledge Base by Search Keywords and Wikidata

  • 1. 1 © Searchmetrics. All rights reserved. Do not distribute without permission. Enriching content with Knowledge Base by Search Keywords and Wikidata Fang Xu f.xu@searchmetrics.com @allxufang
  • 2. 2 © Searchmetrics. All rights reserved. Do not distribute without permission. Data Science@Searchmetrics Data driven search and content optimization marketing • Learning from keywords • Content optimization • Data visualization
  • 3. 3 © Searchmetrics. All rights reserved. Do not distribute without permission. Looooots of Data • 120 Million Domains • 600 Million Keywords • 120 Billion Links • 25,000 Billion Social Signals • 25 PB raw data
  • 4. 4 © Searchmetrics. All rights reserved. Do not distribute without permission. Authors submit content ü Rate the content’s effectiveness ü Feedback to optimize and enrich it Content Production in Real-time
  • 5. 5 © Searchmetrics. All rights reserved. Do not distribute without permission. Beyond keywords • Keyword • Typos • Ambiguous • Sparse • Entity • Augmented with metadata • Relations among entities
  • 6. 6 © Searchmetrics. All rights reserved. Do not distribute without permission. Q64 Entity
  • 7. 7 © Searchmetrics. All rights reserved. Do not distribute without permission.
  • 8. 8 © Searchmetrics. All rights reserved. Do not distribute without permission.http://brendangriffen.com/blog/gow-programming-languages Knowledge Base (KB)
  • 9. 9 © Searchmetrics. All rights reserved. Do not distribute without permission. 2001 2012 2014 2008 Knowledge vaults 2012 2005 KB Timeline
  • 10. 10 © Searchmetrics. All rights reserved. Do not distribute without permission. • Free collaborative KB • Continuous evolution • Open multilingual Data • mapping to other KBs Why Wikidata
  • 11. 11 © Searchmetrics. All rights reserved. Do not distribute without permission. Link content to KB • Entity Linking -- free text to entities • Blog posts • Tweets • Keywords • User-generated Contents • Entities from a knowledge base • Wikipedia • Wikidata • Domain-specific KBs
  • 12. 12 © Searchmetrics. All rights reserved. Do not distribute without permission. Image from Milne and Witten (2008b). Learning to Link with Wikipedia. In CIKM 2008 Entity Linking
  • 13. 13 © Searchmetrics. All rights reserved. Do not distribute without permission. • Identify important keywords to link in the text • Link to right entity Main Problems
  • 14. 14 © Searchmetrics. All rights reserved. Do not distribute without permission. Dictionary of keywords to KB entities Search keyword mentions in text
  • 15. 15 © Searchmetrics. All rights reserved. Do not distribute without permission. Keyword to wiki uris in top SERP
  • 16. 16 © Searchmetrics. All rights reserved. Do not distribute without permission. Not all keywords are useful Keyword Cleaning: • Navigational or factual words • Non-frequent words • Non-latin letters
  • 17. 17 © Searchmetrics. All rights reserved. Do not distribute without permission. Keyword Filtering: • Starting or ending tokens • Stopwords • Part-of-speech tags • Wikipedia popularity: • popular wiki uris for one keyword • Search popularity: • popular keywords for one wiki uri Not all keywords are useful
  • 18. 18 © Searchmetrics. All rights reserved. Do not distribute without permission. Search Popularity Filtering Keyword Search Popularity (Volume) germany 268583 germany facts 4291 germany article 24 german encyclopedia 23 germany encyclopedia 19 germany t 18 ger many 16
  • 19. 19 © Searchmetrics. All rights reserved. Do not distribute without permission. parse wikidata dump & extract entities as json Entity data { entity: "Berlin", Freebase Id: "/m/0156q", OpenStreetMap Relation identifier: 62422, alias: ["Berlin, Germany"], capital of: [ "Germany", "Kingdom of Prussia", "Weimar Republic", "Brandenburg-Prussia", "Free State of Prussia", ... ], contains administrative territorial entity: [ "Mitte", "Friedrichshain-Kreuzberg", "Pankow", "Charlottenburg-Wilmersdorf", "Spandau", "Steglitz-Zehlendorf", "Tempelhof-Schöneberg", "Neukölln", "Treptow-Köpenick", ... ], coordinate location: [ { altitude: null, latitude: 52.516666666667, longitude: 13.383333333333, precision: 0.016666666666667 } ], country: "Germany", ... ... }
  • 20. 20 © Searchmetrics. All rights reserved. Do not distribute without permission. Link to the right Wikipedia entity Word Sense Disambiguation
  • 21. 21 © Searchmetrics. All rights reserved. Do not distribute without permission. d Tree 92.82% Tree (graph theory) 2.94% Tree (data structure) 2.57% Tree (set theory) 0.15% Phylogenetic tree 0.07% Christmas tree 0.07% Binary tree 0.04% Family tree 0.04% … ... Link to Most Common Entities e ew ew L L i , , ew entity,textsurfacewith LinksofNumber Entity Wikipedia Commnoness (Milne and Witten 2008b) tree
  • 22. 22 © Searchmetrics. All rights reserved. Do not distribute without permission. https://en.wikipedia.org/wiki/Tree_data_structure https://en.wikipedia.org/wiki/Tree Disambiguation
  • 23. 23 © Searchmetrics. All rights reserved. Do not distribute without permission. Disambiguation using context
  • 24. 24 © Searchmetrics. All rights reserved. Do not distribute without permission. • Build a Word2Vec model for Wikiepdia entity • Calculate Word2Vec similarity to contextual entities   contextcontext TreestructureTree_data_ )(similarity)(similarity Entity Disambiguation
  • 25. 25 © Searchmetrics. All rights reserved. Do not distribute without permission. Relatedness between Entities
  • 26. 26 © Searchmetrics. All rights reserved. Do not distribute without permission. Image from Milne and Witten (2008a). An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links Entity Relatedness
  • 27. 27 © Searchmetrics. All rights reserved. Do not distribute without permission. • Jaccard similarity • Word2Vec similarity of entity to context ee ee   andentitytolinksofUnion andentitytolinksofonIntersecti Relatedness Score
  • 28. 28 © Searchmetrics. All rights reserved. Do not distribute without permission. Wikipedia Data Parsing
  • 29. 29 © Searchmetrics. All rights reserved. Do not distribute without permission. Wikipedia Dump '''Berlin''' is the [[Capital city|capital]] of [[Germany]] and one of its 16 [[states of Germany|states]]. With a population of approximately 3.5 million people,<ref name="Population" /> Berlin is the second [[Largest cities of the European Union by population within city limits|most populous city proper]] and the seventh [[Largest urban areas of the European Union|most populous urban area]] in the [[European Union]].
  • 30. 30 © Searchmetrics. All rights reserved. Do not distribute without permission. Wikipedia Article as Json
  • 31. 31 © Searchmetrics. All rights reserved. Do not distribute without permission. Word2Vector Training • Collection of plain article text ... ... can4linux ||open_source|| ||controller_area_network|| ||linux_kernel|| ||device_driver|| development started 1990s philips 82c200 controller stand chip 1995 version created bus linux laboratory automation project linux lab project ||freie_universität_berlin|| nxp sja1000 successor supported controller philips 82c200 intel 82527 development powerful ||microcontroller||s integrated controllers capable ... ...
  • 32. 32 © Searchmetrics. All rights reserved. Do not distribute without permission. Linking vectors • Pairs of uri, annotations outlink vector [Capital_City, Germany , States_of_Germany, European_Union, Spree, Havel, Berlin-Brandenburg_Metropolitan_Region, ... ... ] inlink vector [Germany, Prussia, Berlin_Wall, Albert_Einstein, Kosmos_(Berlin), Berlin_International_Film_Festival, .. .. ]
  • 33. 33 © Searchmetrics. All rights reserved. Do not distribute without permission. Wikipedia Popularity • Aggregation of annotations Surface text Wiki entity Popularity United States United_States 174338 World War II World_War_II 106483 India India 95966 France France 94666 American United_States 85976 Iran Iran 83249 Australia Australia 76655 Germany Germany 76384
  • 34. 34 © Searchmetrics. All rights reserved. Do not distribute without permission. Overall System Keyword Database Keyword Processing Parser User Content Keyword Matching Disam- biguation Relatedness calculation Result Wikipedia Popularity Entity Linking API Wiki Parser W2V Model Wiki LinksKeyword to KB entities
  • 35. 35 © Searchmetrics. All rights reserved. Do not distribute without permission. • https://github.com/piskvorky/gensim • https://github.com/jodaiber/Annotated-WikiExtractor • https://dumps.wikimedia.org/ • https://dumps.wikimedia.org/wikidatawiki/entities/ Resources
  • 36. 36 © Searchmetrics. All rights reserved. Do not distribute without permission. Thank you
  • 37. 37 © Searchmetrics. All rights reserved. Do not distribute without permission. Questions? f.xu@searchmetrics.com We are hiring