Being Google

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    9 Favorites & 1 Group

    Being Google - Presentation Transcript

    1. being google tom dyson
    2. V.
    3. metadata is easy
    4. language is hard
    5. Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
    6. >>> doc1 = \"The cow says moo.\" >>> doc2 = \"The sheep says baa.\" >>> doc3 = \"The dogs say woof.\" >>> doc4 = \"The dog-cow says moof.\"
    7. Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print \"found '%s' in '%s'\" % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
    8. my first inverted index
    9. Tokenising #1 >>> doc1.split() ['The', 'cow', 'says', 'moo.']
    10. Tokenising #2 >>> import re >>> word = re.compile('\\W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = \"The dog-cow says moof\" >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
    11. Tokenising #3 >>> word = re.compile('\\s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
    12. Data structures >>> doc1 = {'name':'doc 1', 'content':\"The cow says moo.\"} >>> doc2 = {'name':'doc 2', 'content':\"The sheep says baa.\"} >>> doc3 = {'name':'doc 3', 'content':\"The dogs say woof.\"} >>> doc4 = {'name':'doc 4', 'content':\"The dog-cow says moof.\"}
    13. Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
    14. Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
    15. O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print \"found '%s' in '%s'\" % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
    16. More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
    17. and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
    18. tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
    19. Logistics Storage (serialising, transporting, clustering) Updates Warming up
    20. ranking Density (tf–idf) Position Date Relationships Feedback Editorial
    21. interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types
    22. being google tom dyson

    + Tom DysonTom Dyson, 3 years ago

    custom

    3083 views, 9 favs, 0 embeds more stats

    The elements of full text search.

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3083
      • 3083 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 9
    • Downloads 175
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Groups / Events