Build a Searchable Knowledge Base

Build a Searchable
Knowledge Base
Jimmy Lai
Yahoo! Search Engineer
r97922028 [at] ntu.edu.tw
2014/05/18
http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base

Outline
• Introduction to Knowledge Base
• Construct a Knowledge Base
• Search the Knowledge Base
• string match
• synonym search
• full text search
• geo search
• put all together
• More Applications
2

Knowledge
• Knowledge is power. - Francis Bacon, 1597
• Knowledge is boundless and connected. So, an
efﬁcient interface to search and browse the
knowledge base is essential.
• Let’s try to build a searchable knowledge base.
3

Application of Knowledge
Base
Personal assistant: Siri, Google now
!
!
Search engine: Google’s knowledge graph
4

Construct a Knowledge
Base
1. Find good data sources.
2. Aggregate data as knowledge entity.
3. Construct structured data of knowledge entity.
4. Search the knowledge base.
5. Navigate the knowledge base.
5

Wikipedia
• A collaborated encyclopedia with more than 30M
articles over 287 languages.
!
!
!
• A good source of knowledge base. However the
data of Wikipedia is not well-structured.
6
http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits

DBpedia
• http://wiki.dbpedia.org/About
• Structured data from Wikipedia.
• A good data source for a knowledge base.
7

Knowledge
Entity
9
Identiﬁer
Abstract
Relations

What can Python do for us
• Data Wrangling
• Process the raw text data
• Aggregate the data from different sources
• Output data as json format
• Connecting the Data ﬂow between systems
• Automation script for starting services and
feeding data
• REST API implementing search strategy
10

Example code
git clone git@github.com:jimmylai/knowledge.git!
https://github.com/jimmylai/knowledge!
• required python packages:
1. fabric
2. pysolr
3. django
11

Data Preparation
1. Download data from DBpedia  
http://downloads.dbpedia.org/current/en/
2. Filter out some specific knowledge entity
zcat instance_types_en.nt.bz2 | get_id_list.py

3. Parse and aggregate data entity from files.
12
data file script data field
short_abstracts_en.nt.bz2 get_abstract.py abstract
raw_infobox_properties_en.nt.bz2 get_relation.py relations
geo_coordinates_en.nt.bz2 get_geo.py latlon
redirects_en.nt.bz2 get_redirect.py redirects

Aggregated Data Format
"http://dbpedia.org/resource/Lake_Yosemite": {
"latlon": "37.376389,-120.428889",
"redirects": [
"Lake_yosemite"
],
"abstract": "Lake Yosemite is an artificial freshwater lake located approximately
five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced
is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university
is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand
Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James
Bond Special 1 episode was filmed and tested at Lake Yosemite.",
"relations": {
"type": "http://dbpedia.org/resource/Reservoir",
"location": "http://dbpedia.org/resource/California"
}
}
13

Search by
• Solr is a full-text, real-time search engine based on Apache
lucene.
• Provides REST-like API.
• pysolr make the use of Solr easily.
• Download the latest version 4.8.0 from
http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0
and extract to solr/solr-4.8.0 dir
• Start Solr server and then check the web UI
fab start_solr

http://localhost:8983/solr/
14

Search - String Match
• To be able to search by entity name
python feed_data.py string_match

• config: solr/conf/string_match/schema.xml
<field name="name" type="string" indexed="true" stored="true"
multiValued="false"/>
<field name="abstract" type="string" indexed="false" stored="true"
multiValued="false"/>
• Feed the entities to Solr. Each entity with name and
abstract fields.
15

Search - String Match
16
http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco
%22&wt=json&indent=true
Search by entity name.

Search - Synonym
• To be able to search by synonym of entity name
python feed_data.py synonym_string_match

• config: solr/conf/synonym_string_match/schema.xml
<field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/>
!
<fieldType name="name_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
…
• Restart Solr server and the synonym file will be reloaded.
17

Synonym handling at index
time
18

Synonym handling at query
time
19

Search - Synonym
20
Search by synonym.

Search - Full Text Search
• To be able to search by entity name
python feed_data.py full_text_search

• config: solr/conf/full_text_search/schema.xml
<copyField source="name" dest="text"/>
<copyField source="abstract" dest=“text"/>
!
• Feed the entities to Solr. Each name and abstract
field will be copied to the text field. After that we
can do full text search without specify field to
search.
21

Search - Geo Search
• To be able to search by distance given a location
python feed_data.py geo_search

• config: solr/conf/geo_search/schema.xml
<field name="location" type="location" indexed="true" stored="true"
required="false" multiValued="false" />
• Feed the entities to Solr. Each entity contains a location
field and the format is like "51.670100,-3.230100".
23

24
Given condition on distance

Search - Put All Together
• Search Strategy
1. Input a query
2. Search by synonym match
3. Search by full text
1. If input a location, ﬁlter the result by geo
search
• Implement the search strategy as an API
25

Implement the search
strategy in a Django view
26

Review
• A Knowledge Base with synonym, full-text and geo
search API.
• The knowledge entities are connected by relation.
28

More Applications
• Question answering system:
1.Query analysis: identify the intension (e.g. looking
for speciﬁc type of entity)
2.Search in the knowledge base
3.Return the knowledge entity
29

The modern search engine don’t just provide web page urls. They provide the
direct answer to users.
30

More Data Sources and
Knowledge Entities
• Open Data
!
!
!
• Open APIs
31

My Life in
• Build online services for billions of users.
• Big data mining on cloud infrastructures.
• Open and Innovative working environment.
• International teamwork and English communication.
• Business trips to Silicon Valley.
• Send me your resume if you need a referral.
r97922028 [at] ntu.edu.tw
32

Build a Searchable Knowledge Base

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Build a Searchable Knowledge Base

Similar to Build a Searchable Knowledge Base (20)

More from Jimmy Lai

More from Jimmy Lai (20)

Recently uploaded

Recently uploaded (20)

Build a Searchable Knowledge Base