SlideShare a Scribd company logo
1 of 22
Download to read offline
being google
 tom dyson
V.
metadata is easy
language is hard
Our Corpus:
 1. The cow says moo.
2. The sheep says baa.
3. The dogs say woof.
 4. The dog-cow says
         moof.
>>>   doc1   =   quot;The   cow says moo.quot;
>>>   doc2   =   quot;The   sheep says baa.quot;
>>>   doc3   =   quot;The   dogs say woof.quot;
>>>   doc4   =   quot;The   dog-cow says moof.quot;
Brute force
>>> docs = [doc1, doc2, doc3, doc4]

>>> def searcher(term):
...   for doc in docs:
...     if doc.find(term) > -1:
...       print quot;found '%s' in '%s'quot; % (term, doc)
...
>>> searcher('moo')
found 'moo' in 'The cow says moo.'
my first inverted index
Tokenising #1
>>> doc1.split()
['The', 'cow', 'says', 'moo.']
Tokenising #2
>>> import re
>>> word = re.compile('W+')
>>> word.split(doc1)
['The', 'cow', 'says', 'moo', '']

>>> doc4 = quot;The dog-cow says moofquot;
>>> word.split(doc4)
['The', 'dog', 'cow', 'says', 'moof']
Tokenising #3

>>> word = re.compile('s|[^a-z-]', re.I)
>>> word.split(doc4)
['The', 'dog-cow', 'says', 'moof', '']
Data structures
>>>   doc1   =   {'name':'doc   1',   'content':quot;The   cow says moo.quot;}
>>>   doc2   =   {'name':'doc   2',   'content':quot;The   sheep says baa.quot;}
>>>   doc3   =   {'name':'doc   3',   'content':quot;The   dogs say woof.quot;}
>>>   doc4   =   {'name':'doc   4',   'content':quot;The   dog-cow says moof.quot;}
Postings
>>> postings = {}

>>> for doc in docs:
...   for token in word.split(doc['content']):
...     if len(token) == 0: break
...     doc_name = doc['name']
...     if token not in postings:
...       postings[token.lower()] = [doc_name]
...     else:
...       postings[token.lower()].append(doc_name)
Postings
>>> postings
{'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2',
'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'],
'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say':
['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'],
'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'],
'dogs': ['doc 3']}
O(log n)
>>> def searcher(term):
...   if term in postings:
...     for match in postings[term]:
...       print quot;found '%s' in '%s'quot; % (term, match)
...
>>> searcher('says')
found 'says' in 'doc 1'
found 'says' in 'doc 2'
found 'says' in 'doc 4'
More postings
‘sheep’: [‘doc 2’, [2]]
‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
and more postings
‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]]
‘google’: [‘intro’, [‘field’: ‘title’], 2]]
tokenising #3
  Punctuation
   Stemming
   Stop words
 Parts of Speech
Entity Extraction
     Markup
Logistics
          Storage
(serialising, transporting,
        clustering)
          Updates
       Warming up
ranking
   Density
    (tf–idf)
   Position
      Date
Relationships
  Feedback
   Editorial
interesting search
      Lucene
(Hadoop, Solr, Nutch)
  OpenFTS / MySQL
       Sphinx
   Hyper Estraier
      Xapian
  Other index types
being google
 tom dyson

More Related Content

What's hot

MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it WorksMike Dirolf
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling rogerbodamer
 
Groovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークGroovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークTsuyoshi Yamamoto
 
RESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerRESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerYandex
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsSarah Allen
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB
 
Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02maman__
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LCmaman__
 
コミュニケーションとしてのコード
コミュニケーションとしてのコードコミュニケーションとしてのコード
コミュニケーションとしてのコードAtsushi Shibata
 
CouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceCouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceleinweber
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with PuppetOlinData
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with PuppetWalter Heck
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBMongoDB
 
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Tesora
 

What's hot (20)

Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp KrennJavantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
Javantura v2 - Replication with MongoDB - what could go wrong... - Philipp Krenn
 
MongoDB: How it Works
MongoDB: How it WorksMongoDB: How it Works
MongoDB: How it Works
 
Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling Intro to MongoDB and datamodeling
Intro to MongoDB and datamodeling
 
Groovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトークGroovy ネタ NGK 忘年会2009 ライトニングトーク
Groovy ネタ NGK 忘年会2009 ライトニングトーク
 
RESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens АuerRESTing with the new Yandex.Disk API, Clemens Аuer
RESTing with the new Yandex.Disk API, Clemens Аuer
 
Ruby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and IteratorsRuby Language: Array, Hash and Iterators
Ruby Language: Array, Hash and Iterators
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Presentasi mac'lc-02
Presentasi mac'lc-02Presentasi mac'lc-02
Presentasi mac'lc-02
 
Presentasi Mac'LC
Presentasi Mac'LCPresentasi Mac'LC
Presentasi Mac'LC
 
My_sql_with_php
My_sql_with_phpMy_sql_with_php
My_sql_with_php
 
はじめてのGroovy
はじめてのGroovyはじめてのGroovy
はじめてのGroovy
 
コミュニケーションとしてのコード
コミュニケーションとしてのコードコミュニケーションとしてのコード
コミュニケーションとしてのコード
 
Mongo db
Mongo dbMongo db
Mongo db
 
Codigos
CodigosCodigos
Codigos
 
CouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conferenceCouchDB @ red dirt ruby conference
CouchDB @ red dirt ruby conference
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with PuppetPuppetCamp SEA @ Blk 71 -  Nagios in under 10 mins with Puppet
PuppetCamp SEA @ Blk 71 - Nagios in under 10 mins with Puppet
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
 
Underscore
UnderscoreUnderscore
Underscore
 

Viewers also liked

Personal Carbon Rationing
Personal Carbon RationingPersonal Carbon Rationing
Personal Carbon RationingTom Dyson
 
Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Tom Dyson
 
Hands on with Google App Engine
Hands on with Google App EngineHands on with Google App Engine
Hands on with Google App EngineTom Dyson
 
Dynamic Demand
Dynamic DemandDynamic Demand
Dynamic DemandTom Dyson
 
The mobile web
The mobile webThe mobile web
The mobile webTom Dyson
 
Wychwood CRAG launch
Wychwood CRAG launchWychwood CRAG launch
Wychwood CRAG launchTom Dyson
 
The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon AccountTom Dyson
 

Viewers also liked (7)

Personal Carbon Rationing
Personal Carbon RationingPersonal Carbon Rationing
Personal Carbon Rationing
 
Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?Wagtail - Pourquoi un nouveau CMS?
Wagtail - Pourquoi un nouveau CMS?
 
Hands on with Google App Engine
Hands on with Google App EngineHands on with Google App Engine
Hands on with Google App Engine
 
Dynamic Demand
Dynamic DemandDynamic Demand
Dynamic Demand
 
The mobile web
The mobile webThe mobile web
The mobile web
 
Wychwood CRAG launch
Wychwood CRAG launchWychwood CRAG launch
Wychwood CRAG launch
 
The Making of The Carbon Account
The Making of The Carbon AccountThe Making of The Carbon Account
The Making of The Carbon Account
 

Similar to Being Google

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)Kerry Buckley
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)Pedro Rodrigues
 
python chapter 1
python chapter 1python chapter 1
python chapter 1Raghu nath
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2Raghu nath
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingMuthu Vinayagam
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPlotly
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesMatt Harrison
 
Pre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPaweł Dawczak
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...amit kuraria
 
Python tutorial
Python tutorialPython tutorial
Python tutorialnazzf
 
Python tutorial
Python tutorialPython tutorial
Python tutorialShani729
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsComing Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsKel Cecil
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012Yaqi Zhao
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxPaulntmMilleri
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosEdgar Suarez
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallySean Cribbs
 
Python tutorial
Python tutorialPython tutorial
Python tutorialRajiv Risi
 

Similar to Being Google (20)

What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)What I learned from Seven Languages in Seven Weeks (IPRUG)
What I learned from Seven Languages in Seven Weeks (IPRUG)
 
Five
FiveFive
Five
 
Basics of Python programming (part 2)
Basics of Python programming (part 2)Basics of Python programming (part 2)
Basics of Python programming (part 2)
 
My First Ruby
My First RubyMy First Ruby
My First Ruby
 
python chapter 1
python chapter 1python chapter 1
python chapter 1
 
Python chapter 2
Python chapter 2Python chapter 2
Python chapter 2
 
GE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python ProgrammingGE8151 Problem Solving and Python Programming
GE8151 Problem Solving and Python Programming
 
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of WranglingPLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
PLOTCON NYC: Behind Every Great Plot There's a Great Deal of Wrangling
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Pre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to ElixirPre-Bootcamp introduction to Elixir
Pre-Bootcamp introduction to Elixir
 
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
Write better python code with these 10 tricks | by yong cui, ph.d. | aug, 202...
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Coming Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix ShellsComing Out Of Your Shell - A Comparison of *Nix Shells
Coming Out Of Your Shell - A Comparison of *Nix Shells
 
Pyconie 2012
Pyconie 2012Pyconie 2012
Pyconie 2012
 
7li7w devcon5
7li7w devcon57li7w devcon5
7li7w devcon5
 
How can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docxHow can I make it so my code works so my command line can look like -p (1).docx
How can I make it so my code works so my command line can look like -p (1).docx
 
Desarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutosDesarrollando aplicaciones web en minutos
Desarrollando aplicaciones web en minutos
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing Functionally
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 

Recently uploaded

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Being Google

  • 2. V.
  • 5. Our Corpus: 1. The cow says moo. 2. The sheep says baa. 3. The dogs say woof. 4. The dog-cow says moof.
  • 6. >>> doc1 = quot;The cow says moo.quot; >>> doc2 = quot;The sheep says baa.quot; >>> doc3 = quot;The dogs say woof.quot; >>> doc4 = quot;The dog-cow says moof.quot;
  • 7. Brute force >>> docs = [doc1, doc2, doc3, doc4] >>> def searcher(term): ... for doc in docs: ... if doc.find(term) > -1: ... print quot;found '%s' in '%s'quot; % (term, doc) ... >>> searcher('moo') found 'moo' in 'The cow says moo.'
  • 10. Tokenising #2 >>> import re >>> word = re.compile('W+') >>> word.split(doc1) ['The', 'cow', 'says', 'moo', ''] >>> doc4 = quot;The dog-cow says moofquot; >>> word.split(doc4) ['The', 'dog', 'cow', 'says', 'moof']
  • 11. Tokenising #3 >>> word = re.compile('s|[^a-z-]', re.I) >>> word.split(doc4) ['The', 'dog-cow', 'says', 'moof', '']
  • 12. Data structures >>> doc1 = {'name':'doc 1', 'content':quot;The cow says moo.quot;} >>> doc2 = {'name':'doc 2', 'content':quot;The sheep says baa.quot;} >>> doc3 = {'name':'doc 3', 'content':quot;The dogs say woof.quot;} >>> doc4 = {'name':'doc 4', 'content':quot;The dog-cow says moof.quot;}
  • 13. Postings >>> postings = {} >>> for doc in docs: ... for token in word.split(doc['content']): ... if len(token) == 0: break ... doc_name = doc['name'] ... if token not in postings: ... postings[token.lower()] = [doc_name] ... else: ... postings[token.lower()].append(doc_name)
  • 14. Postings >>> postings {'sheep': ['doc 2'], 'says': ['doc 1', 'doc 2', 'doc 4'], 'cow': ['doc 1'], 'moof': ['doc 4'], 'dog-cow': ['doc 4'], 'woof': ['doc 3'], 'say': ['doc 3'], 'moo': ['doc 1'], 'baa': ['doc 2'], 'The': ['doc 1', 'doc 2', 'doc 3', 'doc 4'], 'dogs': ['doc 3']}
  • 15. O(log n) >>> def searcher(term): ... if term in postings: ... for match in postings[term]: ... print quot;found '%s' in '%s'quot; % (term, match) ... >>> searcher('says') found 'says' in 'doc 1' found 'says' in 'doc 2' found 'says' in 'doc 4'
  • 16. More postings ‘sheep’: [‘doc 2’, [2]] ‘says’: [‘doc 1’, [3], ‘doc 2’, [3], ‘doc 4’, [3]]
  • 17. and more postings ‘sheep’: [‘doc 2’, [‘field’: ‘body’], 2]] ‘google’: [‘intro’, [‘field’: ‘title’], 2]]
  • 18. tokenising #3 Punctuation Stemming Stop words Parts of Speech Entity Extraction Markup
  • 19. Logistics Storage (serialising, transporting, clustering) Updates Warming up
  • 20. ranking Density (tf–idf) Position Date Relationships Feedback Editorial
  • 21. interesting search Lucene (Hadoop, Solr, Nutch) OpenFTS / MySQL Sphinx Hyper Estraier Xapian Other index types