Build a Searchable Knowledge Base
Upcoming SlideShare
Loading in...5
×
 

Build a Searchable Knowledge Base

on

  • 965 views

In this talk, the speaker will demonstrate how to build a searchable knowledge base from scratch. The process includes data wrangling, entity indexing and full text search.

In this talk, the speaker will demonstrate how to build a searchable knowledge base from scratch. The process includes data wrangling, entity indexing and full text search.

Statistics

Views

Total Views
965
Views on SlideShare
958
Embed Views
7

Actions

Likes
8
Downloads
12
Comments
0

3 Embeds 7

http://www.slideee.com 5
http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Build a Searchable Knowledge Base Build a Searchable Knowledge Base Presentation Transcript

  • Build a Searchable Knowledge Base Jimmy Lai Yahoo! Search Engineer r97922028 [at] ntu.edu.tw 2014/05/18 http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
  • Outline • Introduction to Knowledge Base • Construct a Knowledge Base • Search the Knowledge Base • string match • synonym search • full text search • geo search • put all together • More Applications 2
  • Knowledge • Knowledge is power. - Francis Bacon, 1597 • Knowledge is boundless and connected. So, an efficient interface to search and browse the knowledge base is essential. • Let’s try to build a searchable knowledge base. 3 View slide
  • Application of Knowledge Base Personal assistant: Siri, Google now ! ! Search engine: Google’s knowledge graph 4 View slide
  • Construct a Knowledge Base 1. Find good data sources. 2. Aggregate data as knowledge entity. 3. Construct structured data of knowledge entity. 4. Search the knowledge base. 5. Navigate the knowledge base. 5
  • Wikipedia • A collaborated encyclopedia with more than 30M articles over 287 languages. ! ! ! • A good source of knowledge base. However the data of Wikipedia is not well-structured. 6 http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
  • DBpedia • http://wiki.dbpedia.org/About • Structured data from Wikipedia. • A good data source for a knowledge base. 7
  • 8
  • Knowledge Entity 9 Identifier Abstract Relations
  • What can Python do for us • Data Wrangling • Process the raw text data • Aggregate the data from different sources • Output data as json format • Connecting the Data flow between systems • Automation script for starting services and feeding data • REST API implementing search strategy 10
  • Example code git clone git@github.com:jimmylai/knowledge.git! https://github.com/jimmylai/knowledge! • required python packages: 1. fabric 2. pysolr 3. django 11
  • Data Preparation 1. Download data from DBpedia 
 http://downloads.dbpedia.org/current/en/ 2. Filter out some specific knowledge entity zcat instance_types_en.nt.bz2 | get_id_list.py 3. Parse and aggregate data entity from files. 12 data file script data field short_abstracts_en.nt.bz2 get_abstract.py abstract raw_infobox_properties_en.nt.bz2 get_relation.py relations geo_coordinates_en.nt.bz2 get_geo.py latlon redirects_en.nt.bz2 get_redirect.py redirects
  • Aggregated Data Format "http://dbpedia.org/resource/Lake_Yosemite": { "latlon": "37.376389,-120.428889", "redirects": [ "Lake_yosemite" ], "abstract": "Lake Yosemite is an artificial freshwater lake located approximately five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James Bond Special 1 episode was filmed and tested at Lake Yosemite.", "relations": { "type": "http://dbpedia.org/resource/Reservoir", "location": "http://dbpedia.org/resource/California" } } 13
  • Search by • Solr is a full-text, real-time search engine based on Apache lucene. • Provides REST-like API. • pysolr make the use of Solr easily. • Download the latest version 4.8.0 from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0 and extract to solr/solr-4.8.0 dir • Start Solr server and then check the web UI fab start_solr http://localhost:8983/solr/ 14
  • Search - String Match • To be able to search by entity name python feed_data.py string_match • config: solr/conf/string_match/schema.xml <field name="name" type="string" indexed="true" stored="true" multiValued="false"/> <field name="abstract" type="string" indexed="false" stored="true" multiValued="false"/> • Feed the entities to Solr. Each entity with name and abstract fields. 15
  • Search - String Match 16 http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco %22&wt=json&indent=true Search by entity name.
  • Search - Synonym • To be able to search by synonym of entity name python feed_data.py synonym_string_match • config: solr/conf/synonym_string_match/schema.xml <field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/> ! <fieldType name="name_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> … • Restart Solr server and the synonym file will be reloaded. 17
  • Synonym handling at index time 18
  • Synonym handling at query time 19
  • Search - Synonym 20 Search by synonym.
  • Search - Full Text Search • To be able to search by entity name python feed_data.py full_text_search • config: solr/conf/full_text_search/schema.xml <copyField source="name" dest="text"/> <copyField source="abstract" dest=“text"/> ! • Feed the entities to Solr. Each name and abstract field will be copied to the text field. After that we can do full text search without specify field to search. 21
  • Search - Full Text Search 22
  • Search - Geo Search • To be able to search by distance given a location python feed_data.py geo_search • config: solr/conf/geo_search/schema.xml <field name="location" type="location" indexed="true" stored="true" required="false" multiValued="false" /> • Feed the entities to Solr. Each entity contains a location field and the format is like "51.670100,-3.230100". 23
  • 24 Given condition on distance
  • Search - Put All Together • Search Strategy 1. Input a query 2. Search by synonym match 3. Search by full text 1. If input a location, filter the result by geo search • Implement the search strategy as an API 25
  • Implement the search strategy in a Django view 26
  • 27
  • Review • A Knowledge Base with synonym, full-text and geo search API. • The knowledge entities are connected by relation. 28
  • More Applications • Question answering system: 1.Query analysis: identify the intension (e.g. looking for specific type of entity) 2.Search in the knowledge base 3.Return the knowledge entity 29
  • The modern search engine don’t just provide web page urls. They provide the direct answer to users. 30
  • More Data Sources and Knowledge Entities • Open Data ! ! ! • Open APIs 31
  • My Life in • Build online services for billions of users. • Big data mining on cloud infrastructures. • Open and Innovative working environment. • International teamwork and English communication. • Business trips to Silicon Valley. • Send me your resume if you need a referral. r97922028 [at] ntu.edu.tw 32