Building a scalable distributed WWW search engine … NOT in Perl! Presented by Alex Chudnovsky (http://www.majectic12.co.uk) at Birmingham Perl Mongers User Group (http://birmingham.pm.org) V1.0 27/07/05
Merging: turning indexed barrels into single searchable index
Searching: locating documents for given keywords
Data collection (crawling) Base Issues URLs to crawl and receives compressed pages Distributed c rawlers – receive lists of URLs to crawl, crawl them and send back compressed data. In the future will do distributed indexing Note: this stage is optional if you already have data to index, ie list of products with their descriptions
Current Stats Source: http://www.majestic12.co.uk/projects/dsearch/stats.php as of 27/07/05
Indexing Indexing is a process of turning words into numbers and creating inverted index. Data barrel Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (Each of the WordID has list of (ideally sorted) DocIDs) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to have clustered index on WordID
Merging Individual indexed barrels Single searchable index Note: this stage is not necessary if just one barrel is used as there will be no need to remap all Ids from local to their global equivalents.
Searching Searching is a process of finding documents that contain words from search query Doc #0: Birmingham Perl Mongers Doc #1: Birmingham City Doc #2: Perl City Lexicon (maps words to their numeric WordIDs) Birmingham – 0 Perl – 1 Mongers – 2 City – 3 Inverted Index (lists DocIDs for each of the WordID) 0 -> 0, 1 1 -> 0, 2 2 -> 0, 3 -> 1, 2 Note: if you use database then it make sense to cluster on WordID Search query: “Birmingham Perl” WordIDs: 0, 1 Intersection of DocIDs present in both lists (implementation of boolean AND logic): Not matched! 2 n/a Not matched! n/a 1 Matched! 0 0 Result 1 (Perl) 0 (Brum)