Inside Searching

Searching: Needles and Haystacks

Searching for stuff

Why it's important

How it's done

Technical difficulties

Social difficulties

Search and Hugh
Involved in this since about 1982, mainly EEC
projects especially Celex
I currently work for Vienna U on a multimedia
database for stored manuscripts etc.
Computing industry since about 1974
Hugh.barnard@gmail.com

As a Human Activity

Looking for keys

Remembering names and birthdays

Looking up in a book

And [the subject of this] making tools for the
intertubes

Getting a clue, from the above...

How it's Done: 1

Health warning: this explanation is simplified!

Let's take Google

How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?

How it's Done:2

It did it already

A key concept: Indexing
Another key concept: Inverted index [see
wikipedia]:
https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y
- highlighted cat in document x at position y [?]

Why do this at all?

Since Google, Bing, Yahoo already did it...

Lots of interesting technical pieces

Self education

Fun and profit, do it 'better' [?]

Internal search engines, intranet search
engines

Domain specific engines

School or research projects

Parts of the Search Engine
- Spider or Harvester [?]
- Parser/Indexer
- Index Storage
- Retrieval [the bit of Google that we see!]
I'm going to go through these in order...

Spider or Harvester

Go and get a load of stuff from the web
Think of it as a programmatic super-surfer, use
curl: http://curl.haxx.se/ to try, for example

Actually there are tons of ready-made ones:
http://search.cpan.org/~johnd/WWW-Crawler-
Lite-0.005/lib/WWW/Crawler/Lite.pm

Be polite, user-agent name, robots.txt,
throttling etc. [?]

Can you think of some of the problems?

Spidering/Harvesting/Darkweb
Spidering: start at top, follow links
Directory based: index a load of things in a given
directory
Harvesting: academic harvester interfaces,
domain specific, I do this at present
Robots.txt: courtesy, the interactive web and the
darkweb
Problems

Parser/Indexer

Now the fun begins!

Breaking down the stuff you get into tokens

<b>lolcat</b> is indexed as 'lolcat', for
example

Some tagging is preserved as meta-
information, <h1>Cat</h1> for example

Parsing document types text, html, pdf etc

Swish-e: http://www.swish-e.org/ is a small
scale harvester/parser, for example

Some problems/opportunities

Index

What does it look like?
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to the
indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Problems: stop words !!!

Storage

This used to be easy, now lots of options

Sparse data, some entries 'lolcats' have
millions of entries, pyx [look it up] won't have
many

It's a 'lot' of data, google came about by
misspelling googolplex:

Relationals are fairly unsuitable

Nosql and ready-mades:
http://solr-vs-elasticsearch.com/ for example

Retrieval

Here you get results of all this work

Simple, one field, one work

Booleans and implied booleans [lots of works
anded together]

Relevant results, this is the main thing and
links back to the storage and parsing

Some problems, multilingual, non-latin,
synonyms

Technical Difficulties

Finding all the documents

Documents that change, appear or disappear

Making the index and looking at it [it's big!]

Non-latin/accented scripts for latin speakers:
appauvrissement

Can you think of others?

Technical Difficulties:2

Looking for André

Looking for 中国 [what's that incidentally?]

Looking for cat [furry] and cat [computer
command]

Speed of index refresh [days]

Storage and computation

Semantic search

Social Difficulties

Right to be forgotten

Security services and data mining

Privacy and doxing, see visual tagging too

Linking the unlinked

Automatic visual tagging [facebook]

Automatic geolocation [most smartphones]

Any more?

Conclusions

It's a central human activity

It's a vital activity for the web

Very simple central idea, but lots of evolution
possible

There's a societal debate to go with the
technical evolution

Thanks!
Thanks for listening and questions!

Inside Searching

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Inside Searching

Similar to Inside Searching (20)

Recently uploaded

Recently uploaded (20)

Inside Searching