Being Google
Upcoming SlideShare
Loading in...5
×
 

Being Google

on

  • 5,725 views

The elements of full text search.

The elements of full text search.

Statistics

Views

Total Views
5,725
Views on SlideShare
5,722
Embed Views
3

Actions

Likes
11
Downloads
202
Comments
0

1 Embed 3

http://www.slideshare.net 3

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Being Google Being Google Presentation Transcript

    • being google tom dyson
    • V.
    • metadata is easy
    • language is hard
    • Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
    • >>> doc1 = quot;The cow says moo.quot; >>> doc2 = quot;The sheep says baa.quot; >>> doc3 = quot;The dogs say woof.quot; >>> doc4 = quot;The dog-cow says moof.quot;
    • Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print quot;found '%s' in '%s'quot; % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
    • my first inverted index
    • Tokenising #1 >>> doc1.split() ['The', 'cow', 'says', 'moo.']
    • Tokenising #2 >>> import re >>> word = re.compile('W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = quot;The dog-cow says moofquot; >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
    • Tokenising #3 >>> word = re.compile('s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
    • Data structures >>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;} >>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;} >>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;} >>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}
    • Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
    • Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
    • O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print quot;found '%s' in '%s'quot; % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
    • More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
    • and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
    • tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
    • Logistics Storage (serialising, transporting, clustering) Updates Warming up
    • ranking Density (tf–idf) Position Date Relationships Feedback Editorial
    • interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types
    • being google tom dyson