SlideShare a Scribd company logo
Build a Searchable
Knowledge Base
Jimmy Lai
Yahoo! Search Engineer
r97922028 [at] ntu.edu.tw
2014/05/18
http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
Outline
• Introduction to Knowledge Base
• Construct a Knowledge Base
• Search the Knowledge Base
• string match
• synonym search
• full text search
• geo search
• put all together
• More Applications
2
Knowledge
• Knowledge is power. - Francis Bacon, 1597
• Knowledge is boundless and connected. So, an
efficient interface to search and browse the
knowledge base is essential.
• Let’s try to build a searchable knowledge base.
3
Application of Knowledge
Base
Personal assistant: Siri, Google now
!
!
Search engine: Google’s knowledge graph
4
Construct a Knowledge
Base
1. Find good data sources.
2. Aggregate data as knowledge entity.
3. Construct structured data of knowledge entity.
4. Search the knowledge base.
5. Navigate the knowledge base.
5
Wikipedia
• A collaborated encyclopedia with more than 30M
articles over 287 languages.
!
!
!
• A good source of knowledge base. However the
data of Wikipedia is not well-structured.
6
http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
DBpedia
• http://wiki.dbpedia.org/About
• Structured data from Wikipedia.
• A good data source for a knowledge base.
7
8
Knowledge
Entity
9
Identifier
Abstract
Relations
What can Python do for us
• Data Wrangling
• Process the raw text data
• Aggregate the data from different sources
• Output data as json format
• Connecting the Data flow between systems
• Automation script for starting services and
feeding data
• REST API implementing search strategy
10
Example code
git clone git@github.com:jimmylai/knowledge.git!
https://github.com/jimmylai/knowledge!
• required python packages:
1. fabric
2. pysolr
3. django
11
Data Preparation
1. Download data from DBpedia 

http://downloads.dbpedia.org/current/en/
2. Filter out some specific knowledge entity
zcat instance_types_en.nt.bz2 | get_id_list.py

3. Parse and aggregate data entity from files.
12
data file script data field
short_abstracts_en.nt.bz2 get_abstract.py abstract
raw_infobox_properties_en.nt.bz2 get_relation.py relations
geo_coordinates_en.nt.bz2 get_geo.py latlon
redirects_en.nt.bz2 get_redirect.py redirects
Aggregated Data Format
"http://dbpedia.org/resource/Lake_Yosemite": {
"latlon": "37.376389,-120.428889",
"redirects": [
"Lake_yosemite"
],
"abstract": "Lake Yosemite is an artificial freshwater lake located approximately
five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced
is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university
is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand
Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James
Bond Special 1 episode was filmed and tested at Lake Yosemite.",
"relations": {
"type": "http://dbpedia.org/resource/Reservoir",
"location": "http://dbpedia.org/resource/California"
}
}
13
Search by
• Solr is a full-text, real-time search engine based on Apache
lucene.
• Provides REST-like API.
• pysolr make the use of Solr easily.
• Download the latest version 4.8.0 from
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0
and extract to solr/solr-4.8.0 dir
• Start Solr server and then check the web UI
fab start_solr

http://localhost:8983/solr/
14
Search - String Match
• To be able to search by entity name
python feed_data.py string_match

• config: solr/conf/string_match/schema.xml
<field name="name" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="abstract" type="string" indexed="false" stored="true"
multiValued="false"/>
• Feed the entities to Solr. Each entity with name and
abstract fields.
15
Search - String Match
16
http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco
%22&wt=json&indent=true
Search by entity name.
Search - Synonym
• To be able to search by synonym of entity name
python feed_data.py synonym_string_match

• config: solr/conf/synonym_string_match/schema.xml
<field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/>
!
<fieldType name="name_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
…
• Restart Solr server and the synonym file will be reloaded.
17
Synonym handling at index
time
18
Synonym handling at query
time
19
Search - Synonym
20
Search by synonym.
Search - Full Text Search
• To be able to search by entity name
python feed_data.py full_text_search

• config: solr/conf/full_text_search/schema.xml
<copyField source="name" dest="text"/>
<copyField source="abstract" dest=“text"/>
!
• Feed the entities to Solr. Each name and abstract
field will be copied to the text field. After that we
can do full text search without specify field to
search.
21
Search - Full Text Search
22
Search - Geo Search
• To be able to search by distance given a location
python feed_data.py geo_search

• config: solr/conf/geo_search/schema.xml
<field name="location" type="location" indexed="true" stored="true"
required="false" multiValued="false" />
• Feed the entities to Solr. Each entity contains a location
field and the format is like "51.670100,-3.230100".
23
24
Given condition on distance
Search - Put All Together
• Search Strategy
1. Input a query
2. Search by synonym match
3. Search by full text
1. If input a location, filter the result by geo
search
• Implement the search strategy as an API
25
Implement the search
strategy in a Django view
26
27
Review
• A Knowledge Base with synonym, full-text and geo
search API.
• The knowledge entities are connected by relation.
28
More Applications
• Question answering system:
1.Query analysis: identify the intension (e.g. looking
for specific type of entity)
2.Search in the knowledge base
3.Return the knowledge entity
29
The modern search engine don’t just provide web page urls. They provide the
direct answer to users.
30
More Data Sources and
Knowledge Entities
• Open Data
!
!
!
• Open APIs
31
My Life in
• Build online services for billions of users.
• Big data mining on cloud infrastructures.
• Open and Innovative working environment.
• International teamwork and English communication.
• Business trips to Silicon Valley.
• Send me your resume if you need a referral.
r97922028 [at] ntu.edu.tw
32

More Related Content

What's hot

Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
vinay arora
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search Engines
Melissa Brisbin
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
Sarvesh Meena
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
Nikhil Deswal
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
Polara Mayur
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
ishmecse13
 
How Internet Search Engines Work
How Internet Search Engines WorkHow Internet Search Engines Work
How Internet Search Engines Works1140008
 
Search Engine
Search EngineSearch Engine
Search Engine
Coky Fauzi Alfi
 
Smart Searching
Smart SearchingSmart Searching
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
vbaker2210
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
Shivam Saxena
 
Surfing the internet
Surfing the internetSurfing the internet
Surfing the internet
Eveferro
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search EngineNIKHIL NAIR
 
Search Engines
Search EnginesSearch Engines
Search Engines
Chidanand Byahatti
 
Search Engine
Search EngineSearch Engine
Search Engine
Ankush Srivastava
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet guest32ae6
 

What's hot (20)

Search Engines
Search EnginesSearch Engines
Search Engines
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search Engines
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Working of search engine
Working of search engineWorking of search engine
Working of search engine
 
Search engine ppt
Search engine pptSearch engine ppt
Search engine ppt
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
How Internet Search Engines Work
How Internet Search Engines WorkHow Internet Search Engines Work
How Internet Search Engines Work
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Smart Searching
Smart SearchingSmart Searching
Smart Searching
 
Search engines
Search enginesSearch engines
Search engines
 
Search engines powerpoint
Search engines powerpointSearch engines powerpoint
Search engines powerpoint
 
Search engine
Search engineSearch engine
Search engine
 
Training Project Report on Search Engines
Training Project Report on Search EnginesTraining Project Report on Search Engines
Training Project Report on Search Engines
 
Surfing the internet
Surfing the internetSurfing the internet
Surfing the internet
 
Working Of Search Engine
Working Of Search EngineWorking Of Search Engine
Working Of Search Engine
 
Search Engines
Search EnginesSearch Engines
Search Engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Searching the Internet
Searching the Internet Searching the Internet
Searching the Internet
 

Similar to Build a Searchable Knowledge Base

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Trey Grainger
 
Deep Web
Deep WebDeep Web
Deep WebSt John
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
milesw
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
Michael Nelson
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
Adhoura Academy
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
Gambari Amosa Isiaka
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
Amit Sheth
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internet
Maria Theresa
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Michael Nelson
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
Rahul Jain
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorialChris Huang
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
Roberto García
 
Where's the Data?
Where's the Data?Where's the Data?
Where's the Data?
Andrea Payant
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
WiLS
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
Simon Jupp
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
plan4all
 

Similar to Build a Searchable Knowledge Base (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Deep Web
Deep WebDeep Web
Deep Web
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages(Re-)Discovering Lost Web Pages
(Re-)Discovering Lost Web Pages
 
Google Dorks
Google DorksGoogle Dorks
Google Dorks
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Deep web
Deep webDeep web
Deep web
 
Semantic Web: introduction & overview
Semantic Web: introduction & overviewSemantic Web: introduction & overview
Semantic Web: introduction & overview
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Lesson 2 network and the internet
Lesson 2 network and the internetLesson 2 network and the internet
Lesson 2 network and the internet
 
Synchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web PagesSynchronicity: Just-In-Time Discovery of Lost Web Pages
Synchronicity: Just-In-Time Discovery of Lost Web Pages
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
20130310 solr tuorial
20130310 solr tuorial20130310 solr tuorial
20130310 solr tuorial
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Where's the Data?
Where's the Data?Where's the Data?
Where's the Data?
 
Linked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve MeyerLinked Data and Discovery with Steve Meyer
Linked Data and Discovery with Steve Meyer
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Importing life science at a into Neo4j
Importing life science at a into Neo4jImporting life science at a into Neo4j
Importing life science at a into Neo4j
 
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar   Intro to Linked Data and SemanticsINSPIRE Hackathon Webinar   Intro to Linked Data and Semantics
INSPIRE Hackathon Webinar Intro to Linked Data and Semantics
 

More from Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
Jimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
Jimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
Jimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
Jimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
Jimmy Lai
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
Jimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
Jimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
Jimmy Lai
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
Jimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Jimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
Jimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
Jimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
Jimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
Jimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Jimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
Jimmy Lai
 

More from Jimmy Lai (20)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013Big data analysis in python @ PyCon.tw 2013
Big data analysis in python @ PyCon.tw 2013
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 

Recently uploaded

guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
Rogerio Filho
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
Gal Baras
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
GTProductions1
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
laozhuseo02
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
ufdana
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Sanjeev Rampal
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Brad Spiegel Macon GA
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
3ipehhoa
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
laozhuseo02
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
keoku
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
Arif0071
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
nirahealhty
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
Javier Lasa
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
eutxy
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
natyesu
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
JeyaPerumal1
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
3ipehhoa
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
3ipehhoa
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
VivekSinghShekhawat2
 

Recently uploaded (20)

guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...guildmasters guide to ravnica Dungeons & Dragons 5...
guildmasters guide to ravnica Dungeons & Dragons 5...
 
How to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptxHow to Use Contact Form 7 Like a Pro.pptx
How to Use Contact Form 7 Like a Pro.pptx
 
Comptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guideComptia N+ Standard Networking lesson guide
Comptia N+ Standard Networking lesson guide
 
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shopHistory+of+E-commerce+Development+in+China-www.cfye-commerce.shop
History+of+E-commerce+Development+in+China-www.cfye-commerce.shop
 
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
一比一原版(CSU毕业证)加利福尼亚州立大学毕业证成绩单专业办理
 
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesMulti-cluster Kubernetes Networking- Patterns, Projects and Guidelines
Multi-cluster Kubernetes Networking- Patterns, Projects and Guidelines
 
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptx
 
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
1比1复刻(bath毕业证书)英国巴斯大学毕业证学位证原版一模一样
 
The+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptxThe+Prospects+of+E-Commerce+in+China.pptx
The+Prospects+of+E-Commerce+in+China.pptx
 
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
一比一原版(SLU毕业证)圣路易斯大学毕业证成绩单专业办理
 
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024
 
test test test test testtest test testtest test testtest test testtest test ...
test test  test test testtest test testtest test testtest test testtest test ...test test  test test testtest test testtest test testtest test testtest test ...
test test test test testtest test testtest test testtest test testtest test ...
 
This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!This 7-second Brain Wave Ritual Attracts Money To You.!
This 7-second Brain Wave Ritual Attracts Money To You.!
 
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdfJAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
JAVIER LASA-EXPERIENCIA digital 1986-2024.pdf
 
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
一比一原版(LBS毕业证)伦敦商学院毕业证成绩单专业办理
 
BASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptxBASIC C++ lecture NOTE C++ lecture 3.pptx
BASIC C++ lecture NOTE C++ lecture 3.pptx
 
1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...1.Wireless Communication System_Wireless communication is a broad term that i...
1.Wireless Communication System_Wireless communication is a broad term that i...
 
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
原版仿制(uob毕业证书)英国伯明翰大学毕业证本科学历证书原版一模一样
 
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
急速办(bedfordhire毕业证书)英国贝德福特大学毕业证成绩单原版一模一样
 
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptxInternet-Security-Safeguarding-Your-Digital-World (1).pptx
Internet-Security-Safeguarding-Your-Digital-World (1).pptx
 

Build a Searchable Knowledge Base

  • 1. Build a Searchable Knowledge Base Jimmy Lai Yahoo! Search Engineer r97922028 [at] ntu.edu.tw 2014/05/18 http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
  • 2. Outline • Introduction to Knowledge Base • Construct a Knowledge Base • Search the Knowledge Base • string match • synonym search • full text search • geo search • put all together • More Applications 2
  • 3. Knowledge • Knowledge is power. - Francis Bacon, 1597 • Knowledge is boundless and connected. So, an efficient interface to search and browse the knowledge base is essential. • Let’s try to build a searchable knowledge base. 3
  • 4. Application of Knowledge Base Personal assistant: Siri, Google now ! ! Search engine: Google’s knowledge graph 4
  • 5. Construct a Knowledge Base 1. Find good data sources. 2. Aggregate data as knowledge entity. 3. Construct structured data of knowledge entity. 4. Search the knowledge base. 5. Navigate the knowledge base. 5
  • 6. Wikipedia • A collaborated encyclopedia with more than 30M articles over 287 languages. ! ! ! • A good source of knowledge base. However the data of Wikipedia is not well-structured. 6 http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
  • 7. DBpedia • http://wiki.dbpedia.org/About • Structured data from Wikipedia. • A good data source for a knowledge base. 7
  • 8. 8
  • 10. What can Python do for us • Data Wrangling • Process the raw text data • Aggregate the data from different sources • Output data as json format • Connecting the Data flow between systems • Automation script for starting services and feeding data • REST API implementing search strategy 10
  • 11. Example code git clone git@github.com:jimmylai/knowledge.git! https://github.com/jimmylai/knowledge! • required python packages: 1. fabric 2. pysolr 3. django 11
  • 12. Data Preparation 1. Download data from DBpedia 
 http://downloads.dbpedia.org/current/en/ 2. Filter out some specific knowledge entity zcat instance_types_en.nt.bz2 | get_id_list.py 3. Parse and aggregate data entity from files. 12 data file script data field short_abstracts_en.nt.bz2 get_abstract.py abstract raw_infobox_properties_en.nt.bz2 get_relation.py relations geo_coordinates_en.nt.bz2 get_geo.py latlon redirects_en.nt.bz2 get_redirect.py redirects
  • 13. Aggregated Data Format "http://dbpedia.org/resource/Lake_Yosemite": { "latlon": "37.376389,-120.428889", "redirects": [ "Lake_yosemite" ], "abstract": "Lake Yosemite is an artificial freshwater lake located approximately five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James Bond Special 1 episode was filmed and tested at Lake Yosemite.", "relations": { "type": "http://dbpedia.org/resource/Reservoir", "location": "http://dbpedia.org/resource/California" } } 13
  • 14. Search by • Solr is a full-text, real-time search engine based on Apache lucene. • Provides REST-like API. • pysolr make the use of Solr easily. • Download the latest version 4.8.0 from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0 and extract to solr/solr-4.8.0 dir • Start Solr server and then check the web UI fab start_solr http://localhost:8983/solr/ 14
  • 15. Search - String Match • To be able to search by entity name python feed_data.py string_match • config: solr/conf/string_match/schema.xml <field name="name" type="string" indexed="true" stored="true" multiValued="false"/> <field name="abstract" type="string" indexed="false" stored="true" multiValued="false"/> • Feed the entities to Solr. Each entity with name and abstract fields. 15
  • 16. Search - String Match 16 http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco %22&wt=json&indent=true Search by entity name.
  • 17. Search - Synonym • To be able to search by synonym of entity name python feed_data.py synonym_string_match • config: solr/conf/synonym_string_match/schema.xml <field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/> ! <fieldType name="name_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> … • Restart Solr server and the synonym file will be reloaded. 17
  • 18. Synonym handling at index time 18
  • 19. Synonym handling at query time 19
  • 21. Search - Full Text Search • To be able to search by entity name python feed_data.py full_text_search • config: solr/conf/full_text_search/schema.xml <copyField source="name" dest="text"/> <copyField source="abstract" dest=“text"/> ! • Feed the entities to Solr. Each name and abstract field will be copied to the text field. After that we can do full text search without specify field to search. 21
  • 22. Search - Full Text Search 22
  • 23. Search - Geo Search • To be able to search by distance given a location python feed_data.py geo_search • config: solr/conf/geo_search/schema.xml <field name="location" type="location" indexed="true" stored="true" required="false" multiValued="false" /> • Feed the entities to Solr. Each entity contains a location field and the format is like "51.670100,-3.230100". 23
  • 25. Search - Put All Together • Search Strategy 1. Input a query 2. Search by synonym match 3. Search by full text 1. If input a location, filter the result by geo search • Implement the search strategy as an API 25
  • 26. Implement the search strategy in a Django view 26
  • 27. 27
  • 28. Review • A Knowledge Base with synonym, full-text and geo search API. • The knowledge entities are connected by relation. 28
  • 29. More Applications • Question answering system: 1.Query analysis: identify the intension (e.g. looking for specific type of entity) 2.Search in the knowledge base 3.Return the knowledge entity 29
  • 30. The modern search engine don’t just provide web page urls. They provide the direct answer to users. 30
  • 31. More Data Sources and Knowledge Entities • Open Data ! ! ! • Open APIs 31
  • 32. My Life in • Build online services for billions of users. • Big data mining on cloud infrastructures. • Open and Innovative working environment. • International teamwork and English communication. • Business trips to Silicon Valley. • Send me your resume if you need a referral. r97922028 [at] ntu.edu.tw 32