SlideShare a Scribd company logo
1 of 18
Searching: Needles and Haystacks

Searching for stuff

Why it's important

How it's done

Technical difficulties

Social difficulties
Search and Hugh
Involved in this since about 1982, mainly EEC
projects especially Celex
I currently work for Vienna U on a multimedia
database for stored manuscripts etc.
Computing industry since about 1974
Hugh.barnard@gmail.com
As a Human Activity

Looking for keys

Remembering names and birthdays

Looking up in a book

And [the subject of this] making tools for the
intertubes

Getting a clue, from the above...
How it's Done: 1

Health warning: this explanation is simplified!

Let's take Google

How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?
How it's Done:2

It did it already

A key concept: Indexing
Another key concept: Inverted index [see
wikipedia]:
https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y
- highlighted cat in document x at position y [?]
Why do this at all?

Since Google, Bing, Yahoo already did it...

Lots of interesting technical pieces

Self education

Fun and profit, do it 'better' [?]

Internal search engines, intranet search
engines

Domain specific engines

School or research projects
Parts of the Search Engine
- Spider or Harvester [?]
- Parser/Indexer
- Index Storage
- Retrieval [the bit of Google that we see!]
I'm going to go through these in order...
Spider or Harvester

Go and get a load of stuff from the web
Think of it as a programmatic super-surfer, use
curl: http://curl.haxx.se/ to try, for example

Actually there are tons of ready-made ones:
http://search.cpan.org/~johnd/WWW-Crawler-
Lite-0.005/lib/WWW/Crawler/Lite.pm

Be polite, user-agent name, robots.txt,
throttling etc. [?]

Can you think of some of the problems?
Spidering/Harvesting/Darkweb
Spidering: start at top, follow links
Directory based: index a load of things in a given
directory
Harvesting: academic harvester interfaces,
domain specific, I do this at present
Robots.txt: courtesy, the interactive web and the
darkweb
Problems
Parser/Indexer

Now the fun begins!

Breaking down the stuff you get into tokens

<b>lolcat</b> is indexed as 'lolcat', for
example

Some tagging is preserved as meta-
information, <h1>Cat</h1> for example

Parsing document types text, html, pdf etc

Swish-e: http://www.swish-e.org/ is a small
scale harvester/parser, for example

Some problems/opportunities
Index

What does it look like?
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to the
indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Problems: stop words !!!
Storage

This used to be easy, now lots of options

Sparse data, some entries 'lolcats' have
millions of entries, pyx [look it up] won't have
many

It's a 'lot' of data, google came about by
misspelling googolplex:

Relationals are fairly unsuitable

Nosql and ready-mades:
http://solr-vs-elasticsearch.com/ for example
Retrieval

Here you get results of all this work

Simple, one field, one work

Booleans and implied booleans [lots of works
anded together]

Relevant results, this is the main thing and
links back to the storage and parsing

Some problems, multilingual, non-latin,
synonyms
Technical Difficulties

Finding all the documents

Documents that change, appear or disappear

Making the index and looking at it [it's big!]

Non-latin/accented scripts for latin speakers:
appauvrissement

Can you think of others?
Technical Difficulties:2

Looking for André

Looking for 中国 [what's that incidentally?]

Looking for cat [furry] and cat [computer
command]

Speed of index refresh [days]

Storage and computation

Semantic search
Social Difficulties

Right to be forgotten

Security services and data mining

Privacy and doxing, see visual tagging too

Linking the unlinked

Automatic visual tagging [facebook]

Automatic geolocation [most smartphones]

Any more?
Conclusions

It's a central human activity

It's a vital activity for the web

Very simple central idea, but lots of evolution
possible

There's a societal debate to go with the
technical evolution
Thanks!
Thanks for listening and questions!

More Related Content

Viewers also liked (16)

Guachimán Electoral Presentación
Guachimán Electoral PresentaciónGuachimán Electoral Presentación
Guachimán Electoral Presentación
 
Resume-7
Resume-7Resume-7
Resume-7
 
Palabras monseñor
Palabras monseñorPalabras monseñor
Palabras monseñor
 
Recruitment general
Recruitment generalRecruitment general
Recruitment general
 
Fl plan kampanii
Fl plan kampaniiFl plan kampanii
Fl plan kampanii
 
Seja Audacity na criação de Podcasts
Seja Audacity na criação de PodcastsSeja Audacity na criação de Podcasts
Seja Audacity na criação de Podcasts
 
Tmp profiles
Tmp profilesTmp profiles
Tmp profiles
 
Hd boxing amir khan vs chris algieri fighting
Hd boxing amir khan vs chris algieri fightingHd boxing amir khan vs chris algieri fighting
Hd boxing amir khan vs chris algieri fighting
 
November 11 dominik
November 11 dominik November 11 dominik
November 11 dominik
 
November 11th
November 11thNovember 11th
November 11th
 
Diploma IIPMM
Diploma IIPMMDiploma IIPMM
Diploma IIPMM
 
エントリーNo4 eat heanhさん
エントリーNo4 eat heanhさんエントリーNo4 eat heanhさん
エントリーNo4 eat heanhさん
 
Shadow Volumes on Programmable Graphics Hardware
Shadow Volumes on Programmable Graphics HardwareShadow Volumes on Programmable Graphics Hardware
Shadow Volumes on Programmable Graphics Hardware
 
Booklet for coaches [eng]
Booklet for coaches [eng]Booklet for coaches [eng]
Booklet for coaches [eng]
 
Resume Presentation
Resume PresentationResume Presentation
Resume Presentation
 
Logaritmos e Musica
Logaritmos e MusicaLogaritmos e Musica
Logaritmos e Musica
 

Similar to Inside Searching

Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
Joseba Abaitua
 
Introduction to Full-Text Search
Introduction to Full-Text SearchIntroduction to Full-Text Search
Introduction to Full-Text Search
Cristian Vat
 
Organise your life and create frameworks with a digital library (schoolnetsa11)
Organise your life and create frameworks with a digital library (schoolnetsa11)Organise your life and create frameworks with a digital library (schoolnetsa11)
Organise your life and create frameworks with a digital library (schoolnetsa11)
Maggie Verster
 

Similar to Inside Searching (20)

The Three Core Topic Types
The Three Core Topic TypesThe Three Core Topic Types
The Three Core Topic Types
 
Getting started in digital preservation
Getting started in digital preservationGetting started in digital preservation
Getting started in digital preservation
 
DMDW Lesson 01 - Introduction
DMDW Lesson 01 - IntroductionDMDW Lesson 01 - Introduction
DMDW Lesson 01 - Introduction
 
Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.Editing Correspondence. The I in TEI.
Editing Correspondence. The I in TEI.
 
Does metadata matter?
Does metadata matter?Does metadata matter?
Does metadata matter?
 
Decoding and developing the online finding aid
Decoding and developing the online finding aidDecoding and developing the online finding aid
Decoding and developing the online finding aid
 
Digital Preservation in Practice
Digital Preservation in PracticeDigital Preservation in Practice
Digital Preservation in Practice
 
DITA 101 -- Why the Buzz
DITA 101 -- Why the BuzzDITA 101 -- Why the Buzz
DITA 101 -- Why the Buzz
 
KI University - Git internals
KI University - Git internalsKI University - Git internals
KI University - Git internals
 
SHAREmodule2
SHAREmodule2SHAREmodule2
SHAREmodule2
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Tracking the Tiddlythesaurus
Tracking the TiddlythesaurusTracking the Tiddlythesaurus
Tracking the Tiddlythesaurus
 
eLearning 2.0 - Karrer - ASTD OC 2007
eLearning 2.0 - Karrer - ASTD OC 2007eLearning 2.0 - Karrer - ASTD OC 2007
eLearning 2.0 - Karrer - ASTD OC 2007
 
Introduction to EAD
Introduction to EADIntroduction to EAD
Introduction to EAD
 
D2.5 Object model and metadata: Open issues
D2.5 Object model and metadata: Open issuesD2.5 Object model and metadata: Open issues
D2.5 Object model and metadata: Open issues
 
Introduction to Full-Text Search
Introduction to Full-Text SearchIntroduction to Full-Text Search
Introduction to Full-Text Search
 
HTML5: About Damn Time
HTML5: About Damn TimeHTML5: About Damn Time
HTML5: About Damn Time
 
Organise your life and create frameworks with a digital library (schoolnetsa11)
Organise your life and create frameworks with a digital library (schoolnetsa11)Organise your life and create frameworks with a digital library (schoolnetsa11)
Organise your life and create frameworks with a digital library (schoolnetsa11)
 
Master Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno SiebesMaster Minds on Data Science - Arno Siebes
Master Minds on Data Science - Arno Siebes
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 

Recently uploaded

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
sexy call girls service in goa
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 

Recently uploaded (20)

Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine ServiceHot Service (+9316020077 ) Goa  Call Girls Real Photos and Genuine Service
Hot Service (+9316020077 ) Goa Call Girls Real Photos and Genuine Service
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
Dwarka Sector 26 Call Girls | Delhi | 9999965857 🫦 Vanshika Verma More Our Se...
 
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Ashram Chowk Delhi 💯Call Us 🔝8264348440🔝
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
'Future Evolution of the Internet' delivered by Geoff Huston at Everything Op...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 

Inside Searching

  • 1. Searching: Needles and Haystacks  Searching for stuff  Why it's important  How it's done  Technical difficulties  Social difficulties
  • 2. Search and Hugh Involved in this since about 1982, mainly EEC projects especially Celex I currently work for Vienna U on a multimedia database for stored manuscripts etc. Computing industry since about 1974 Hugh.barnard@gmail.com
  • 3. As a Human Activity  Looking for keys  Remembering names and birthdays  Looking up in a book  And [the subject of this] making tools for the intertubes  Getting a clue, from the above...
  • 4. How it's Done: 1  Health warning: this explanation is simplified!  Let's take Google  How does it find one zillion documents/images with 'lolcat' in them, within a few seconds?
  • 5. How it's Done:2  It did it already  A key concept: Indexing Another key concept: Inverted index [see wikipedia]: https://en.wikipedia.org/wiki/Inverted_index - lolcat in document x at position y - highlighted cat in document x at position y [?]
  • 6. Why do this at all?  Since Google, Bing, Yahoo already did it...  Lots of interesting technical pieces  Self education  Fun and profit, do it 'better' [?]  Internal search engines, intranet search engines  Domain specific engines  School or research projects
  • 7. Parts of the Search Engine - Spider or Harvester [?] - Parser/Indexer - Index Storage - Retrieval [the bit of Google that we see!] I'm going to go through these in order...
  • 8. Spider or Harvester  Go and get a load of stuff from the web Think of it as a programmatic super-surfer, use curl: http://curl.haxx.se/ to try, for example  Actually there are tons of ready-made ones: http://search.cpan.org/~johnd/WWW-Crawler- Lite-0.005/lib/WWW/Crawler/Lite.pm  Be polite, user-agent name, robots.txt, throttling etc. [?]  Can you think of some of the problems?
  • 9. Spidering/Harvesting/Darkweb Spidering: start at top, follow links Directory based: index a load of things in a given directory Harvesting: academic harvester interfaces, domain specific, I do this at present Robots.txt: courtesy, the interactive web and the darkweb Problems
  • 10. Parser/Indexer  Now the fun begins!  Breaking down the stuff you get into tokens  <b>lolcat</b> is indexed as 'lolcat', for example  Some tagging is preserved as meta- information, <h1>Cat</h1> for example  Parsing document types text, html, pdf etc  Swish-e: http://www.swish-e.org/ is a small scale harvester/parser, for example  Some problems/opportunities
  • 11. Index  What does it look like? T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" we have the following inverted file index (where the integers in the set notation brackets refer to the indexes (or keys) of the text symbols, T[0], T[1] etc.): "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Problems: stop words !!!
  • 12. Storage  This used to be easy, now lots of options  Sparse data, some entries 'lolcats' have millions of entries, pyx [look it up] won't have many  It's a 'lot' of data, google came about by misspelling googolplex:  Relationals are fairly unsuitable  Nosql and ready-mades: http://solr-vs-elasticsearch.com/ for example
  • 13. Retrieval  Here you get results of all this work  Simple, one field, one work  Booleans and implied booleans [lots of works anded together]  Relevant results, this is the main thing and links back to the storage and parsing  Some problems, multilingual, non-latin, synonyms
  • 14. Technical Difficulties  Finding all the documents  Documents that change, appear or disappear  Making the index and looking at it [it's big!]  Non-latin/accented scripts for latin speakers: appauvrissement  Can you think of others?
  • 15. Technical Difficulties:2  Looking for André  Looking for 中国 [what's that incidentally?]  Looking for cat [furry] and cat [computer command]  Speed of index refresh [days]  Storage and computation  Semantic search
  • 16. Social Difficulties  Right to be forgotten  Security services and data mining  Privacy and doxing, see visual tagging too  Linking the unlinked  Automatic visual tagging [facebook]  Automatic geolocation [most smartphones]  Any more?
  • 17. Conclusions  It's a central human activity  It's a vital activity for the web  Very simple central idea, but lots of evolution possible  There's a societal debate to go with the technical evolution