Build a Searchable Knowledge Base

4,501 views

Published on

In this talk, the speaker will demonstrate how to build a searchable knowledge base from scratch. The process includes data wrangling, entity indexing and full text search.

Published in: Internet, Technology, Education
0 Comments
15 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,501
On SlideShare
0
From Embeds
0
Number of Embeds
19
Actions
Shares
0
Downloads
102
Comments
0
Likes
15
Embeds 0
No embeds

No notes for slide

Build a Searchable Knowledge Base

  1. 1. Build a Searchable Knowledge Base Jimmy Lai Yahoo! Search Engineer r97922028 [at] ntu.edu.tw 2014/05/18 http://www.slideshare.net/jimmy_lai/build-a-searchable-knowledge-base
  2. 2. Outline • Introduction to Knowledge Base • Construct a Knowledge Base • Search the Knowledge Base • string match • synonym search • full text search • geo search • put all together • More Applications 2
  3. 3. Knowledge • Knowledge is power. - Francis Bacon, 1597 • Knowledge is boundless and connected. So, an efficient interface to search and browse the knowledge base is essential. • Let’s try to build a searchable knowledge base. 3
  4. 4. Application of Knowledge Base Personal assistant: Siri, Google now ! ! Search engine: Google’s knowledge graph 4
  5. 5. Construct a Knowledge Base 1. Find good data sources. 2. Aggregate data as knowledge entity. 3. Construct structured data of knowledge entity. 4. Search the knowledge base. 5. Navigate the knowledge base. 5
  6. 6. Wikipedia • A collaborated encyclopedia with more than 30M articles over 287 languages. ! ! ! • A good source of knowledge base. However the data of Wikipedia is not well-structured. 6 http://www.theguardian.com/technology/blog/2009/aug/13/wikipedia-edits
  7. 7. DBpedia • http://wiki.dbpedia.org/About • Structured data from Wikipedia. • A good data source for a knowledge base. 7
  8. 8. 8
  9. 9. Knowledge Entity 9 Identifier Abstract Relations
  10. 10. What can Python do for us • Data Wrangling • Process the raw text data • Aggregate the data from different sources • Output data as json format • Connecting the Data flow between systems • Automation script for starting services and feeding data • REST API implementing search strategy 10
  11. 11. Example code git clone git@github.com:jimmylai/knowledge.git! https://github.com/jimmylai/knowledge! • required python packages: 1. fabric 2. pysolr 3. django 11
  12. 12. Data Preparation 1. Download data from DBpedia 
 http://downloads.dbpedia.org/current/en/ 2. Filter out some specific knowledge entity zcat instance_types_en.nt.bz2 | get_id_list.py 3. Parse and aggregate data entity from files. 12 data file script data field short_abstracts_en.nt.bz2 get_abstract.py abstract raw_infobox_properties_en.nt.bz2 get_relation.py relations geo_coordinates_en.nt.bz2 get_geo.py latlon redirects_en.nt.bz2 get_redirect.py redirects
  13. 13. Aggregated Data Format "http://dbpedia.org/resource/Lake_Yosemite": { "latlon": "37.376389,-120.428889", "redirects": [ "Lake_yosemite" ], "abstract": "Lake Yosemite is an artificial freshwater lake located approximately five miles (8 km) east of Merced, California in the rolling Sierra Foothills. UC Merced is situated approximately half a mile (0.8 km) south of Lake Yosemite. The university is bounded by the lake on one side and two canals (Fairfield Canal and Le Grand Canal) run through the campus. In 2007, a myth featured in the Mythbusters' James Bond Special 1 episode was filmed and tested at Lake Yosemite.", "relations": { "type": "http://dbpedia.org/resource/Reservoir", "location": "http://dbpedia.org/resource/California" } } 13
  14. 14. Search by • Solr is a full-text, real-time search engine based on Apache lucene. • Provides REST-like API. • pysolr make the use of Solr easily. • Download the latest version 4.8.0 from http://www.apache.org/dyn/closer.cgi/lucene/solr/4.8.0 and extract to solr/solr-4.8.0 dir • Start Solr server and then check the web UI fab start_solr http://localhost:8983/solr/ 14
  15. 15. Search - String Match • To be able to search by entity name python feed_data.py string_match • config: solr/conf/string_match/schema.xml <field name="name" type="string" indexed="true" stored="true" multiValued="false"/> <field name="abstract" type="string" indexed="false" stored="true" multiValued="false"/> • Feed the entities to Solr. Each entity with name and abstract fields. 15
  16. 16. Search - String Match 16 http://localhost:8983/solr/string_match/select?q=name%3A%22San+Francisco %22&wt=json&indent=true Search by entity name.
  17. 17. Search - Synonym • To be able to search by synonym of entity name python feed_data.py synonym_string_match • config: solr/conf/synonym_string_match/schema.xml <field name="name" type=“name_text" indexed="true" stored="true" multiValued="false"/> ! <fieldType name="name_text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> … • Restart Solr server and the synonym file will be reloaded. 17
  18. 18. Synonym handling at index time 18
  19. 19. Synonym handling at query time 19
  20. 20. Search - Synonym 20 Search by synonym.
  21. 21. Search - Full Text Search • To be able to search by entity name python feed_data.py full_text_search • config: solr/conf/full_text_search/schema.xml <copyField source="name" dest="text"/> <copyField source="abstract" dest=“text"/> ! • Feed the entities to Solr. Each name and abstract field will be copied to the text field. After that we can do full text search without specify field to search. 21
  22. 22. Search - Full Text Search 22
  23. 23. Search - Geo Search • To be able to search by distance given a location python feed_data.py geo_search • config: solr/conf/geo_search/schema.xml <field name="location" type="location" indexed="true" stored="true" required="false" multiValued="false" /> • Feed the entities to Solr. Each entity contains a location field and the format is like "51.670100,-3.230100". 23
  24. 24. 24 Given condition on distance
  25. 25. Search - Put All Together • Search Strategy 1. Input a query 2. Search by synonym match 3. Search by full text 1. If input a location, filter the result by geo search • Implement the search strategy as an API 25
  26. 26. Implement the search strategy in a Django view 26
  27. 27. 27
  28. 28. Review • A Knowledge Base with synonym, full-text and geo search API. • The knowledge entities are connected by relation. 28
  29. 29. More Applications • Question answering system: 1.Query analysis: identify the intension (e.g. looking for specific type of entity) 2.Search in the knowledge base 3.Return the knowledge entity 29
  30. 30. The modern search engine don’t just provide web page urls. They provide the direct answer to users. 30
  31. 31. More Data Sources and Knowledge Entities • Open Data ! ! ! • Open APIs 31
  32. 32. My Life in • Build online services for billions of users. • Big data mining on cloud infrastructures. • Open and Innovative working environment. • International teamwork and English communication. • Business trips to Silicon Valley. • Send me your resume if you need a referral. r97922028 [at] ntu.edu.tw 32

×