Being Google

Our Corpus:
1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
4. The dog-cow says
moof.

>>> doc1 = quot;The cow says moo.quot;
>>> doc2 = quot;The sheep says baa.quot;
>>> doc3 = quot;The dogs say woof.quot;
>>> doc4 = quot;The dog-cow says moof.quot;

Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
... for doc in docs:
... if doc.find(term) > -1:
... print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.'

Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']

Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 = quot;The dog-cow says moofquot;
['The', 'dog', 'cow', 'says', 'moof']

Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
['The', 'dog-cow', 'says', 'moof', '']

Data structures
>>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;}
>>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;}
>>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;}
>>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}

Postings
>>> postings = {}

>>> for doc in docs:
... for token in word.split(doc['content']):
... if len(token) == 0: break
... doc_name = doc['name']
... if token not in postings:
... postings[token.lower()] = [doc_name]
... else:
... postings[token.lower()].append(doc_name)

Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']}

O(log n)
>>> def searcher(term):
... if term in postings:
... for match in postings[term]:
... print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'

More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]

and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]

tokenising #3
Punctuation
Stemming
Stop words
Parts of Speech
Entity Extraction
Markup

Logistics
Storage
(serialising, transporting,
clustering)
Updates
Warming up

ranking
Density
(tf–idf)
Position
Date
Relationships
Feedback
Editorial

interesting search
Lucene
(Hadoop, Solr, Nutch)
OpenFTS / MySQL
Sphinx
Hyper Estraier
Xapian
Other index types

Being Google

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Being Google

Similar to Being Google (20)

Recently uploaded

Recently uploaded (20)

Being Google