0
being google
 tom dyson
V.
metadata is easy
language is hard
Our Corpus:
 1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
 4. The dog-cow says
         moof.
>>>   doc1   =   quot;The   cow says moo.quot;
>>>   doc2   =   quot;The   sheep says baa.quot;
>>>   doc3   =   quot;The ...
Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
...   for doc in docs:
...     if doc.find(term) ...
my first inverted index
Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']
Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 =...
Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
>>> word.split(doc4)
['The', 'dog-cow', 'says', 'moof', '']
Data structures
>>>   doc1   =   {'name':'doc   1',   'content':quot;The   cow says moo.quot;}
>>>   doc2   =   {'name':'d...
Postings
>>> postings = {}

>>> for doc in docs:
...   for token in word.split(doc['content']):
...     if len(token) == 0...
Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-...
O(log n)
>>> def searcher(term):
...   if term in postings:
...     for match in postings[term]:
...       print quot;foun...
More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]
tokenising #3
  Punctuation
   Stemming
   Stop words
 Parts of Speech
Entity Extraction
     Markup
Logistics
          Storage
(serialising, transporting,
        clustering)
          Updates
       Warming up
ranking
   Density
    (tf–idf)
   Position
      Date
Relationships
  Feedback
   Editorial
interesting search
      Lucene
(Hadoop, Solr, Nutch)
  OpenFTS / MySQL
       Sphinx
   Hyper Estraier
      Xapian
  Oth...
being google
 tom dyson
Upcoming SlideShare
Loading in...5
×

Being Google

3,365

Published on

The elements of full text search.

Published in: Technology, Education
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,365
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
203
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide

Transcript of "Being Google"

  1. 1. being google tom dyson
  2. 2. V.
  3. 3. metadata is easy
  4. 4. language is hard
  5. 5. Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
  6. 6. >>> doc1 = quot;The cow says moo.quot; >>> doc2 = quot;The sheep says baa.quot; >>> doc3 = quot;The dogs say woof.quot; >>> doc4 = quot;The dog-cow says moof.quot;
  7. 7. Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print quot;found '%s' in '%s'quot; % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
  8. 8. my first inverted index
  9. 9. Tokenising #1 >>> doc1.split() ['The', 'cow', 'says', 'moo.']
  10. 10. Tokenising #2 >>> import re >>> word = re.compile('W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = quot;The dog-cow says moofquot; >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
  11. 11. Tokenising #3 >>> word = re.compile('s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
  12. 12. Data structures >>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;} >>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;} >>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;} >>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}
  13. 13. Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
  14. 14. Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
  15. 15. O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print quot;found '%s' in '%s'quot; % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
  16. 16. More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
  17. 17. and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
  18. 18. tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
  19. 19. Logistics Storage (serialising, transporting, clustering) Updates Warming up
  20. 20. ranking Density (tf–idf) Position Date Relationships Feedback Editorial
  21. 21. interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types
  22. 22. being google tom dyson
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×