Surfacing the deep web

1,409 views

Published on

Presentation given at Internet Librarian International 2013 - Websearch Academy. 14 October 2013 on deep-web searching

Published in: Business, Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,409
On SlideShare
0
From Embeds
0
Number of Embeds
26
Actions
Shares
0
Downloads
44
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Surfacing the deep web

  1. 1. WebSearch Academy Internet Librarian International Surfacing the Deep Web Arthur Weiss Email: a.weiss@aware.co.uk / Twitter: @awareci www.marketing-intelligence.co.uk 14 October 2013 © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
  2. 2. Not everything can be found with Google…. The ‘Invisible Web’ or ‘Deep Web’ consists of web pages and documents which are not indexed by conventional search engines or are poorly or incompletely indexed. © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
  3. 3. 5 Types of “Invisibility” Not search engine optimised so pages fail to appear in “simple” searches © AWARE 2013 Not indexed by search engines Excluded from search index Subscription or proprietary content Encrypted or nonindexable content Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 3
  4. 4. Know your tool kit or Standard Google © AWARE 2013 Multiple approaches & tools Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 4
  5. 5. What do I need to find? What sort of needle? What sort of haystack? http://www.morguefile.com/archive/display/21091 © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 5
  6. 6. Why will the information be available? Where will it be held (Who will know it?) Can I obtain it legally and ethically from this source & if so, how? If not, are there other sources or ways of obtaining the information? After obtaining the information are any checks needed to verify it? What is the information’s relationship to other information? © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 6
  7. 7. Not everything is online or can be found! •  Try to find:   Original TV coverage of the storming of the Bastille1   A newspaper interview with Christopher Columbus, following his return from discovering America   A recording of Abraham Lincoln delivering the Gettysburg address   A photo of Jesus in his crib (Question from a 9 year old: “Why didn’t anybody take photos with their phones?”) 1 With thanks to Karen Blakeman of RBA Information (rba.co.uk) for these examples © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
  8. 8. “Forty-two! Is that all you’ve got to show for seven and a half million year’s work?” “I checked it very thoroughly and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.” Douglas Adams, “The Hitchhiker’s Guide to the Galaxy” If your search approach is wrong, it doesn’t matter which approach or tool you use, or how you use it. Your results will be poor or wrong. © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
  9. 9. Before starting to search consider sources for the subject / topic of interest… Why is information likely to be available? Consider also file-formats, and location of search terms What search tool / approach is most likely to access or index the information’s location (container) Are there unique terms or jargon that lead to a specialist tool e.g. Lung cancer (consumer) versus pulmonary carcinoma (medical) Are there societies, organisations, people, or groups that may have information? (Who/where else could have information?) Would any of the relevant pages be in another language? “cheap hotel in Dubai” OR “‫”ﻓﻨﺪق اﻗﺘﺼﺎدي ﻓﻲ دﺑﻲ‬ © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 9
  10. 10. Before starting to search: consider search terms for the topic or subject of interest Are there any synonyms or variant spellings? Tyre or tire; Aluminum Candy or sweet Basle or Basel Are there any other words likely to be in documents on the topic? Are any keywords part of a common phrase? Are any keywords likely to be in irrelevant documents that should be excluded from searches? How might the information be written? “I work for Xcompany” to search for employees of Xcompany © AWARE 2013 “X is better than” for comparisons Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 10
  11. 11. Research Planning Information Requirements © AWARE 2013 Break down into individual questions that, when answered, will provide the required knowledge Don’t start searching without knowing what you are looking for, and why Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 11
  12. 12. An example research plan Copy & fill in sheet for each key information question / topic Research Topic Research Questions (breakdown topic into answerable questions) Sources LINKEDIN GOOGLE SCHOLAR NATIONAL STATISTICS © AWARE 2013 Search Approach / Parameters JOB TITLE, CURRENT EMPLOYER, ETC. AUTHOR NAME, TOPIC, DATE, ETC. SITE SEARCH ENGINE Type of information expected Comments / Possible problems PEOPLE PROFILES MAY NOT BE ACCURATE OR IN-DATE CITATIONS, ACADEMIC DOESN T COVER RESEARCH PAPERS . EVERYTHING CENSUS & DEMOGRAPHIC MAY BE OLD OR DATA INCOMPLETE Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 12
  13. 13. Types of “Invisibility” Not search engine optimised so pages fail to appear in “simple” searches © AWARE 2013 Not indexed by search engines Excluded from search index Subscription or proprietary content Encrypted or nonindexable content Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 13
  14. 14. Advanced Searching •  Use advanced search operators and options e.g. Filetype: / InTitle: / InUrl: / .. (numeric) and * (wildcard) © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 14
  15. 15. Search Engines – not just Google © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk
  16. 16. Types of “Invisibility” Not search engine optimised so pages fail to appear in “simple” searches © AWARE 2013 Not indexed by search engines Excluded from search index Subscription or proprietary content Encrypted or nonindexable content Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 16
  17. 17. Specialist Search / Deep Web Search © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 17
  18. 18. Search for Information “Containers” •  Knowing a reason for the information to be available can lead to an information source   Who else would want this information?   Search for topic + “Database” e.g. Coffee database – first two results: © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 18
  19. 19. Case Examples – Economics by Country © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 19
  20. 20. Case Examples – Trade Statistics © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 20
  21. 21. Case Examples – Economic Indicators © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 21
  22. 22. Case Examples – Genealogy © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 22
  23. 23. Types of “Invisibility” Not search engine optimised so pages fail to appear in “simple” searches © AWARE 2013 Not indexed by search engines Excluded from search index Subscription or proprietary content Encrypted or nonindexable content Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 23
  24. 24. Proprietary sites / Blocked from Index •  Register for password protected sites •  Use site search or site map – if available •  If Robots.txt file exists may be able to view the hidden pages e.g. nytimes.com/robots.txt © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 24
  25. 25. Types of “Invisibility” Not search engine optimised so pages fail to appear in “simple” searches © AWARE 2013 Not indexed by search engines Excluded from search index Subscription or proprietary content Encrypted or nonindexable content Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 25
  26. 26. Content that can’t / won’t be indexed •  Non-textual information e.g. multimedia / audiovisual   Bing has search operators that can find RSS feeds (hasfeed:) and pages containing specific types of file (e.g. mp3 files – contains:mp3)   Search for related textual information e.g. descriptions, or sources (e.g. artwork or film titles) •  Encrypted information / .Onion sites   Project Tor (torproject.org) and the TOR browser Access encrypted sites via proxy servers © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 26
  27. 27. Searching TOR •  On regular Google: fake passport site:onion.to © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 27
  28. 28. TOR / .Onion Sites © AWARE 2013 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 28
  29. 29. Any Questions? Arthur Weiss is the managing director of AWARE - a UK based consultancy specialising in marketing & competitive intelligence analysis. Contact Details: Web Sites: www.marketing-intelligence.co.uk E-mail: a.weiss@aware.co.uk Twitter: @awareci Telephone: Fax: © AWARE 2013 +44 20 8954 9121 +44 20 8954 2102 Tel: +44 20 8954 9121 • Fax: +44 20 8954 2102 • Web: www.marketing-intelligence.co.uk 29

×