Slideshare.net (beta)

 
Post: 
Myspace Hi5 Friendster Xanga LiveJournal Facebook Blogger Tagged Typepad Freewebs BlackPlanet gigya icons



All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 10 (more)

Being Google

From tomdyson, 1 year ago

The elements of full text search.

2028 views  |  0 comments  |  9 favorites  |  156 downloads
 

Groups/Events

 
 

Privacy InfoNew!

This slideshow is Public

 
Embed in your blog
Embed (wordpress.com)
custom

Slideshow Statistics
Total Views: 2028
on Slideshare: 2028
from embeds: 0* * Views from embeds since 21 Aug, 07

Slideshow transcript

Slide 1: being google tom dyson

Slide 2: V.

Slide 3: metadata is easy

Slide 4: language is hard

Slide 5: Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.

Slide 6: >>> doc1 = \"The cow says moo.\" >>> doc2 = \"The sheep says baa.\" >>> doc3 = \"The dogs say woof.\" >>> doc4 = \"The dog-cow says moof.\"

Slide 7: Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print \"found '%s' in '%s'\" % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'

Slide 8: my first inverted index

Slide 9: Tokenising #1 >>> doc1.split() ['The', 'cow', 'says', 'moo.']

Slide 10: Tokenising #2 >>> import re >>> word = re.compile('\\W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = \"The dog-cow says moof\" >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']

Slide 11: Tokenising #3 >>> word = re.compile('\\s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']

Slide 12: Data structures >>> doc1 = {'name':'doc 1', 'content':\"The cow says moo.\"} >>> doc2 = {'name':'doc 2', 'content':\"The sheep says baa.\"} >>> doc3 = {'name':'doc 3', 'content':\"The dogs say woof.\"} >>> doc4 = {'name':'doc 4', 'content':\"The dog-cow says moof.\"}

Slide 13: Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)

Slide 14: Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}

Slide 15: O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print \"found '%s' in '%s'\" % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'

Slide 16: More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]

Slide 17: and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]

Slide 18: tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup

Slide 19: Logistics Storage (serialising, transporting, clustering) Updates Warming up

Slide 20: ranking Density (tf–idf) Position Date Relationships Feedback Editorial

Slide 21: interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types

Slide 22: being google tom dyson