Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Inside Searching
1. Searching: Needles and Haystacks
Searching for stuff
Why it's important
How it's done
Technical difficulties
Social difficulties
2. Search and Hugh
Involved in this since about 1982, mainly EEC
projects especially Celex
I currently work for Vienna U on a multimedia
database for stored manuscripts etc.
Computing industry since about 1974
Hugh.barnard@gmail.com
3. As a Human Activity
Looking for keys
Remembering names and birthdays
Looking up in a book
And [the subject of this] making tools for the
intertubes
Getting a clue, from the above...
4. How it's Done: 1
Health warning: this explanation is simplified!
Let's take Google
How does it find one zillion documents/images
with 'lolcat' in them, within a few seconds?
5. How it's Done:2
It did it already
A key concept: Indexing
Another key concept: Inverted index [see
wikipedia]:
https://en.wikipedia.org/wiki/Inverted_index
- lolcat in document x at position y
- highlighted cat in document x at position y [?]
6. Why do this at all?
Since Google, Bing, Yahoo already did it...
Lots of interesting technical pieces
Self education
Fun and profit, do it 'better' [?]
Internal search engines, intranet search
engines
Domain specific engines
School or research projects
7. Parts of the Search Engine
- Spider or Harvester [?]
- Parser/Indexer
- Index Storage
- Retrieval [the bit of Google that we see!]
I'm going to go through these in order...
8. Spider or Harvester
Go and get a load of stuff from the web
Think of it as a programmatic super-surfer, use
curl: http://curl.haxx.se/ to try, for example
Actually there are tons of ready-made ones:
http://search.cpan.org/~johnd/WWW-Crawler-
Lite-0.005/lib/WWW/Crawler/Lite.pm
Be polite, user-agent name, robots.txt,
throttling etc. [?]
Can you think of some of the problems?
9. Spidering/Harvesting/Darkweb
Spidering: start at top, follow links
Directory based: index a load of things in a given
directory
Harvesting: academic harvester interfaces,
domain specific, I do this at present
Robots.txt: courtesy, the interactive web and the
darkweb
Problems
10. Parser/Indexer
Now the fun begins!
Breaking down the stuff you get into tokens
<b>lolcat</b> is indexed as 'lolcat', for
example
Some tagging is preserved as meta-
information, <h1>Cat</h1> for example
Parsing document types text, html, pdf etc
Swish-e: http://www.swish-e.org/ is a small
scale harvester/parser, for example
Some problems/opportunities
11. Index
What does it look like?
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
we have the following inverted file index (where the integers in the set notation brackets refer to the
indexes (or keys) of the text symbols, T[0], T[1] etc.):
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Problems: stop words !!!
12. Storage
This used to be easy, now lots of options
Sparse data, some entries 'lolcats' have
millions of entries, pyx [look it up] won't have
many
It's a 'lot' of data, google came about by
misspelling googolplex:
Relationals are fairly unsuitable
Nosql and ready-mades:
http://solr-vs-elasticsearch.com/ for example
13. Retrieval
Here you get results of all this work
Simple, one field, one work
Booleans and implied booleans [lots of works
anded together]
Relevant results, this is the main thing and
links back to the storage and parsing
Some problems, multilingual, non-latin,
synonyms
14. Technical Difficulties
Finding all the documents
Documents that change, appear or disappear
Making the index and looking at it [it's big!]
Non-latin/accented scripts for latin speakers:
appauvrissement
Can you think of others?
15. Technical Difficulties:2
Looking for André
Looking for 中国 [what's that incidentally?]
Looking for cat [furry] and cat [computer
command]
Speed of index refresh [days]
Storage and computation
Semantic search
16. Social Difficulties
Right to be forgotten
Security services and data mining
Privacy and doxing, see visual tagging too
Linking the unlinked
Automatic visual tagging [facebook]
Automatic geolocation [most smartphones]
Any more?
17. Conclusions
It's a central human activity
It's a vital activity for the web
Very simple central idea, but lots of evolution
possible
There's a societal debate to go with the
technical evolution