SlideShare a Scribd company logo
1 of 7
Download to read offline
The Ultimate Guide to the Invisible Web
Published on Monday 18th of December, 2006 from OEDb.org

When you use a search engine on the Internet and can't find what you're looking for, what do you do?
Maybe you're seeking to learn something, which means you're probably going to keep trying until you
find it. Or give up in frustration. Don't give up that easily. There's information out there that is actually
not indexed in the big search engines. Such Web pages are part of what's called the Dark, Deep,
Hidden or Invisible Web. Those pages that are actually indexed are known by some as the surface Web.
Fortunately, the invisible Web is getting easier to search, with tools beyond the standard big three
search engines such as Google, Yahoo, and MSN.

In the early days of the Web, computing power and storage space was at such a premium that the few
search engines that were around often indexed only a tiny fraction of Web pages and not even full
pages at that. But eventually space became relatively cheap and engines started indexing pages in full
(full text), as well as more pages. Still, engines miss a lot of pages. Here's a guide to those "invisible"
pages.

Background of the Invisible Web

   1. The term. "Invisible" is purely search engine-centric, indicating any Web page that can be
      accessed by at least one person but which is not indexed in a search engine. Many people
      prefer the term "Deep Web" instead.
   2. Its size. No one knows for sure. Danny Sullivan, a search engine expert and formerly of Search
      Engine Watch, wrote in 2000 that the invisible Web was about 500 times Google's index of one
      billion pages. New estimates [NY Times' free registration may be required] of Google's index
      sets it at over 8 billion at the time of this writing. (Claims by its archival archrival Yahoo! of 19+
      billion pages were considered questionable.) Search engines are said to only crawl 16-20% of
      the Internet
   3. Its real size. The most likely entity to be able to make any sort of "accurate" estimate is Google,
      though if they've made a recent estimate of either the current size or growth rate of the
      invisible Web, that information itself appears to be invisible. (They would have a list
      somewhere of never-crawled URLs, which would be a mere starting point, as there would also
      be all those countless URLs even they cannot get to. Without this, how could an estimate be
      calculated?)
   4. A guesstimate. Any astute mathematician with an understanding of Web content management
      systems, content databases, and dynamically-served Web pages would probably say between 1
      and 4 trillion pages, then conclude the near impossibility of an accurate estimate, especially
      because of the rapidly increasing number of invisible sites. It's easier to compare search engine
      index size.
   5. Example of futility of estimating. A library or museum gets gifted with a collection of one
      million digital images and decides to create a Web-accessible database. Each image will have its
own dynamically-served page, accessible via a query form. Just like that, one million new pages
      have been added to the invisible Web.
   6. How many invisible sites. In the same article by Danny Sullivan (above), he indicates
      BrightPlanet's estimate of 100,000 as being the number of "significant invisible websites" out of
      about 200,000. That was in 2000, so it's a hopelessly outdated estimate. Since then, weblogs
      have been added to the mix, and many of them go uncrawled, increasing the number of
      invisible sites.
   7. Rate of growth of invisible sites. Technorati's David Sifry said that10,0000 new blogs are
      created daily as of October 2006, but he also said 175,000 daily as of July 2006. Even at the
      lower figure, if at least one page on each new blog is never indexed, the size of the invisible
      Web is growing at around 36.5 million new pages per year. That doesn't even include other
      types of invisible content (described elsewhere in this article).
   8. Will this change? Google recently filed a patent application related to searching content
      through Web-based forms. SEO by the SEA speculates that they are planning to index more of
      the invisible Web and goes on to explain a possible methodology. Google's Eric Schmidt (or
      possibly founders Larry Page and Sergey Brin) has said Google is dedicated to indexing the
      world's content, however long it takes. Also, more previously invisible pages are getting
      indexed because of manually-added links to them from visible pages.

9 Reasons a Web Page is Invisible

By "invisible", this does not mean a Web page is necessarily inaccessible. It simply means it's not
indexed by a search engine and is thus "invisible" to a searcher who does not know of its existence.
There are several reasons why a page may be invisible. Keep in mind that some pages are only
temporarily invisble, possibly being indexed at a later date. The general rule of thumb is that just
because a search engine finds no results does not mean it's not there. The list below also includes
examples of content types gleaned from Internet Tutorials.

   1. Dynamic URLs. Engines have traditionally ignored any Web pages whose URLs have a long
      string of parameters and equal signs and question marks, on the off chance that they'll
      duplicate what's in their database — or worse — the spider will somehow go around in circles.
      Danny Sullivan refers to such pages as part of the "shallow web".
   2. Form-controlled entry, non-passworded. In this case, page content only gets displayed when a
      human applies a set of actions, mostly entering data into a form (specific query information,
      such as job criteria for a job search engine). This typically includes databases that generate
      pages on demand and hence cannot be indexed by a spider. Applicable content includes travel
      industry data (flight info, hotel availability), job listings, product databases, patents, publicly-
      accessible governent information, dictionary definitions, laws, stock market data, phone books
      and professional directories.
   3. Passworded access, subscription or non subscription. This includes VPN (virtual private
      networks) and any Web site where some pages require username and password information.
      Access may or may not be by paid subscription. However, BrightPlanet found in 2001 that 95%
      of the invisible Web is publicly accessible without fees or subscriptions. Applicable content
includes academic and corporate databases, newspaper or journal content, and academic
        library subscriptions.
   4.   Time-limited access. On some sites, such as the New York Times or Marketing Profs, content
        becomes inaccessible after a certain time without a password. Search engines retain the URL,
        but the page generates a sign-up form, and the content is moved to a new URL that requires a
        password. Note that the content is sometimes cached by an engine. The NY Times also has
        alternate URLs to some time-dated content that show the original content without a password.
        You just have to know how to get to it.
   5.   Too new. If a site is relatively new, it's likely that most or none of its Web pages will be indexed
        by any engine. This results in the site's pages being mostly invisible for a short period of time (2-
        6 months).
   6.   Robots exclusion. The robots.txt file, which usually lives in the main directory of a Web site,
        tells search robots which files and directories should not be indexed. Hence its name "robots
        exclusion file." If this file is setup, it will block certain pages from being indexed, which will
        hence be invisible to searchers.
   7.   Flash presentation. Text content in Flash presentations is not indexed, though additional meta-
        information might be.
   8.   Geo-tagged. A site's Web server can check for the supposed geographic location, via the IP
        address, of a visitor's computer. Those computers from certain regions can be blocked out. That
        may include blocking some search engines. For example, several American TV broadcasters are
        now showing video online, but the pages are only accessible to US citizens, sometimes only in
        certain regions or certain states.
   9.   Hidden pages. One of the simplest and most common reasons for invisible Web pages is that
        they are hidden. That is, there is simply no sequence of hyperlink clicks that could take you to
        such a page. The pages are accessible, but only people who know of their existence know how
        to view them.

10 Ways to Make Invisible Content Visible

We have discussed what type of content is invisible and where we might find such information. Now
imagine if there were some way to make some of that invisible content more visible. That's possible for
some Web pages.

   1. Do a static dump. If you have a small database of content, you may want to simply dump it out
      to one static HTML page, with relevant formatting and necessary hyperlinks, then link to this
      static page from an already "visible" (indexed) page.
   2. Do categorized database publishing. If you have a database of, say, products, you could publish
      select information to static category and overview pages, thereby making content available
      without form-based or query-generated access. Of course, this works best for information that
      does not become outdated. Job listings, for example, may not suit this method.
   3. Convert formats. Word processors, spreadsheets, slideshows, PDFs, audio, video all used to be
      part of the invisible Web. However, Google and other text search engines started indexing their
      contents a few years ago, adding to the available pages of the visible Web. The benefit to
      librarians and researchers, etc., is that it's now easier to find a particular piece of text. But if you
have a format such as Flash, which isn't indexed, you could publish a static version of the text
       content, to supplement the rich media.
   4. Transcribe information. Have audio or video content such as a podcast? Transcribe the
       information and publish it as supplementary text.
   5. Build links. Link to your own pages from other related pages. If you write about, say, trees on
       page A, then write about trees again on page B, link from page B to page A to give A more
       relevance. If page A hasn't been indexed, it will be after B is indexed. Points 6-9 are alternate
       ways to build links, hence helping make content visible.
   6. Publish a sitemap. Not the new XML kind that the Big 3 search engines agreed to a standard on,
       but an HTML page that maps out the main sections of your site. This is essentially a way to build
       links (#5). Each main section will in turn link to specific pages. The result is that a spider has a
       relevance map with which to decide what to index. Then again, you can also use the new type
       of sitemap to achieve deep indexing. Chris Pearson offers a sitemap generator and template.
   7. Build a topic pyramid. This is a specialized form of sitemap that actually spans many pages. The
       apex (top-most) page has general topics and links to the next layer of pages, which have more
       specific topics and links to the next layer. The bottom-most layer of the topic pyramid are your
       original Web pages or blog posts, which have the most specific content. This method builds
       page relevance via the serial linking, which induces spiders to want to visit and index.
   8. Write about it elsewhere. This is a form of link-building. When someone writes about an
       invisible page and links to it, it becomes visible by proxy, once an engine follows through and
       indexes it.
   9. Socially bookmark it. If you find something, say a book at The Gutenberg Project, that you like,
       bookmark the URL at a social bookmarking site such as Del.icio.us and a brief description.
   10. Remove access restrictions. Get rid of the need to login, or don't apply time-limits.

How to Access and Search for Invisible Content

If a site publisher does none of the above to make their content more accessible, there are still ways to
make the content available, if not the actual pages.

Imagine if there was a search engine that could help you access some of the invisible Web. It would
have an advantage over traditional engines. Well there's more than one such engine, and even
traditional engines are making a move in that direction. The larger engines already index rich media
such as PDF files, word processor documents, spreadsheets, etc.

Invisible Web engines have taken a different approach, collaborating with Web site publishers to index
the otherwise invisible content. But for invisible content that cannot and/or should not be made
visible, there are still a number of ways to get access:

      Be a student, alumnus, or professor to gain access to university records and library journals.
      Be an employee of a company with a VPN over the Internet.
      Request access. This might be as simple as signing up for free.
      Pay for a subscription.
   Request a "dump" page of a database. Sometimes a request to the right person will gain you
       this data.
      Use a Deep Web engine, portal, or directory.

To actually search for effectively invisible content:

   1. Use a site's search engine. These tend not to be as robust for complex query terms, and usually
      are quite literal about the search string, but they are more likely to show you where invisible
      content is than a regular engine.
   2. Use site archive navigation. On weblogs in particular, you can use the archive links to find info,
      albeit through manual searching.
   3. Use the word "database". Using the word "database" in your regular search engine query will
      often find you information that is otherwise nearly impossible to find. For example, if you are
      looking for a database of images, you can type the search string images database into Google
      or one of the other engines. Somewhere down the results list in Google, you'll find Full-Text
      Database Images from the USPTO (US Patent and Trademark Office). You can then use the
      Quick or Advanced search forms to find patents relating to one or more terms. If there are
      images to be seen, there will be links to them.
   4. Use a suitable resource. Use an "invisible Web" directory, portal or specialized search engine
      such as Google Book Search, Google Scholar, Librarian's Internet Index, or BrightPlanet's
      Complete Planet (70,000 searchable databases and specialty search engines).

15 Invisible Web Search Tools

BrightPlanet estimated in 2001 that in excess of 200,000 "Deep Web" sites existed. They found that 60
of the largest of these sites collectively contained 40 times the pages in the surface Web (at the time),
and that despite being invisible in the engines, receive a significant amount of traffic. Here is a small
sampling of invisible Web search tools (directories, portals, engines) to help you find some invisible
content. To see more like these, please look at our Research Beyond Google article.

   1. Deep Web Search Engine — Clusty.
   2. Art — Musie du Louvre.
   3. Books Online — The Online Books Page.
   4. Business — Explorit Now!.
   5. Consumer — US Consumer Products Safety Commission Recalled Products.
   6. Economic and Job Data — FreeLunc.com — A searchable directory of free economic data.
   7. Finance and Investing — Bankrate.com.
   8. General Research — GPO's Catalog of US Government Publications.
   9. Government Data — Copyright Records (LOCIS).
   10. International — International Data Base (IDB).
   11. Law and Politics — THOMAS (Library of Congress).
   12. Library of Congress — Library of Congress.
   13. Medical and Health — PubMed.
   14. Science — ScienceResearch.com.
15. Transportation — FAA Flight Delay Information.

References and Resources

These are relevant references that are not linked to above, which may be of interest to writers and
researchers. There's a strong leaning to research papers here, some of which have dozens of links to
PDF documents on the technical aspects of accessing, indexing and retrieving deep Web content. A few
references below are to companies offering "internet intelligence" tools and software.

   1. About WebSearch — Christmas 2006 web search guide.
   2. About Websearch — The deep web — find out more about the deep web — deep web search.
   3. ALA — American Library Association.
   4. BrightPlanet — FAQ.
   5. Deep Web Research — A gigantic list of resources.
   6. Deep Web Technologies.
   7. Ellipsis — Metadata, Google, and the Invisible Web.
   8. Envisional.
   9. Google Librarian Center.
   10. Google Library Project.
   11. Lifehacker — How to search the invisible web.
   12. MediaBistro — Some resources for freelancers.
   13. MetaQuerier — Exploring and integrating the deep web.
   14. QProber — Classifying and searching hidden-web text databases.
   15. The Invisible Web Weblog.
   16. University of California, Berkeley — Invisible or deep web.

Did you enjoy this article? Bookmark it at del.icio.us »

Browse Our Library Categories:

      Beginning Online Learning
      Choosing a Degree
      Choosing a Program
      Choosing a School
      College Basics
      Continuing Education for Adults
      Distance vs. Local Education
      Features
      Financial Aid Information
      Military Assistance Degrees
      Online Class Assignments
      Starting a Career
Featured Online Colleges




Kaplan University Online
Choose from over 70 programs, online or at one of over 75 campuses.




American InterContinental University Online
Complete your degree faster than typically possible, at home!




South University
Established in 1899, South has many convenient online degree programs.

Online Colleges Online Degrees Online Programs Library Rankings Financial Aid Blog
Copyright © 2006-2010 OEDb - Accredited Online, Specialty, and Campus-Based Colleges

More Related Content

What's hot

Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Ronald Soh
 
101 seo tips for 2013
101 seo tips for 2013101 seo tips for 2013
101 seo tips for 2013asep komara
 
The duck soup link building guide
The duck soup link building guideThe duck soup link building guide
The duck soup link building guideTabish Javed
 
An Illustrated History of Blackhat SEO
An Illustrated History of Blackhat SEOAn Illustrated History of Blackhat SEO
An Illustrated History of Blackhat SEOPatrick Coombe
 
Get Top
Get Top Get Top
Get Top auto446
 
Online identity class.pdf
Online identity class.pdfOnline identity class.pdf
Online identity class.pdfCMHSL
 
Advanced internet search
Advanced internet searchAdvanced internet search
Advanced internet searchMegan Heuer
 
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to askEverything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to askBill Slawski
 
Teaching Internet Research
Teaching Internet ResearchTeaching Internet Research
Teaching Internet ResearchArun Kumar
 
Building and Maintaining Genealogical Websites
Building and Maintaining Genealogical WebsitesBuilding and Maintaining Genealogical Websites
Building and Maintaining Genealogical WebsitesGenealogyMedia.com
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete ApproachPrakhar Gethe
 
SEO Web Links: Directory Alternatives
SEO Web Links: Directory AlternativesSEO Web Links: Directory Alternatives
SEO Web Links: Directory Alternativesffats1
 
SEO Web Links: Directory Alternatives
SEO Web Links: Directory AlternativesSEO Web Links: Directory Alternatives
SEO Web Links: Directory Alternativesgemerson72
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR CodesJudy Horn
 
Effective Internet Research
Effective Internet ResearchEffective Internet Research
Effective Internet ResearchAmy Madigan
 
Journalists and the Social Web 3
Journalists and the Social Web 3Journalists and the Social Web 3
Journalists and the Social Web 3ardessie
 

What's hot (20)

Search engine
Search engineSearch engine
Search engine
 
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
 
101 seo tips for 2013
101 seo tips for 2013101 seo tips for 2013
101 seo tips for 2013
 
The duck soup link building guide
The duck soup link building guideThe duck soup link building guide
The duck soup link building guide
 
An Illustrated History of Blackhat SEO
An Illustrated History of Blackhat SEOAn Illustrated History of Blackhat SEO
An Illustrated History of Blackhat SEO
 
Get Top
Get Top Get Top
Get Top
 
About search engines
About search enginesAbout search engines
About search engines
 
Online identity class.pdf
Online identity class.pdfOnline identity class.pdf
Online identity class.pdf
 
Internet research
Internet researchInternet research
Internet research
 
Advanced internet search
Advanced internet searchAdvanced internet search
Advanced internet search
 
Everything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to askEverything you wanted to know about crawling, but didn't know where to ask
Everything you wanted to know about crawling, but didn't know where to ask
 
Teaching Internet Research
Teaching Internet ResearchTeaching Internet Research
Teaching Internet Research
 
Building and Maintaining Genealogical Websites
Building and Maintaining Genealogical WebsitesBuilding and Maintaining Genealogical Websites
Building and Maintaining Genealogical Websites
 
How google works and functions: A complete Approach
How google works and functions: A complete ApproachHow google works and functions: A complete Approach
How google works and functions: A complete Approach
 
Facebook Coin
Facebook CoinFacebook Coin
Facebook Coin
 
SEO Web Links: Directory Alternatives
SEO Web Links: Directory AlternativesSEO Web Links: Directory Alternatives
SEO Web Links: Directory Alternatives
 
SEO Web Links: Directory Alternatives
SEO Web Links: Directory AlternativesSEO Web Links: Directory Alternatives
SEO Web Links: Directory Alternatives
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR Codes
 
Effective Internet Research
Effective Internet ResearchEffective Internet Research
Effective Internet Research
 
Journalists and the Social Web 3
Journalists and the Social Web 3Journalists and the Social Web 3
Journalists and the Social Web 3
 

Similar to The ultimate guide to the invisible web

Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and Moreeclark131
 
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)Technical SEO: How to Perform an SEO Audit (Step by Step Guide)
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)Ryan Stewart
 
How Google Search Engine Algorithm Works ??
How Google Search Engine Algorithm Works ??How Google Search Engine Algorithm Works ??
How Google Search Engine Algorithm Works ??Viral Shah
 
How Google Search Algorithm Works ??
How Google Search Algorithm Works ??How Google Search Algorithm Works ??
How Google Search Algorithm Works ??viralshahb
 
Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawlingBurhan Ahmed
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimizationshrishail uttagi
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Web3.0- How brands can take advantage of the semantic shift - Brandsential
Web3.0- How brands can take advantage of the semantic shift -  BrandsentialWeb3.0- How brands can take advantage of the semantic shift -  Brandsential
Web3.0- How brands can take advantage of the semantic shift - BrandsentialJeffrey V
 
pranav,sahil and shriman presents search engine
pranav,sahil and shriman presents search enginepranav,sahil and shriman presents search engine
pranav,sahil and shriman presents search engineCool Bhatt
 
How search engine works
How search engine worksHow search engine works
How search engine worksAshraf Ali
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Enginevinay arora
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOFlutterbyBarb
 

Similar to The ultimate guide to the invisible web (20)

Deep-Hidden-Invisible Web
Deep-Hidden-Invisible WebDeep-Hidden-Invisible Web
Deep-Hidden-Invisible Web
 
Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and More
 
Seo
SeoSeo
Seo
 
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)Technical SEO: How to Perform an SEO Audit (Step by Step Guide)
Technical SEO: How to Perform an SEO Audit (Step by Step Guide)
 
Search V Next Final
Search V Next FinalSearch V Next Final
Search V Next Final
 
Seo Manual
Seo ManualSeo Manual
Seo Manual
 
Webmaster tools (ICMK485)
Webmaster tools (ICMK485)Webmaster tools (ICMK485)
Webmaster tools (ICMK485)
 
How Google Search Engine Algorithm Works ??
How Google Search Engine Algorithm Works ??How Google Search Engine Algorithm Works ??
How Google Search Engine Algorithm Works ??
 
How Google Search Algorithm Works ??
How Google Search Algorithm Works ??How Google Search Algorithm Works ??
How Google Search Algorithm Works ??
 
Challenges in web crawling
Challenges in web crawlingChallenges in web crawling
Challenges in web crawling
 
Seo report
Seo reportSeo report
Seo report
 
Not Your Mom's SEO
Not Your Mom's SEONot Your Mom's SEO
Not Your Mom's SEO
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Using Search Engines
Using Search EnginesUsing Search Engines
Using Search Engines
 
Web3.0- How brands can take advantage of the semantic shift - Brandsential
Web3.0- How brands can take advantage of the semantic shift -  BrandsentialWeb3.0- How brands can take advantage of the semantic shift -  Brandsential
Web3.0- How brands can take advantage of the semantic shift - Brandsential
 
pranav,sahil and shriman presents search engine
pranav,sahil and shriman presents search enginepranav,sahil and shriman presents search engine
pranav,sahil and shriman presents search engine
 
How search engine works
How search engine worksHow search engine works
How search engine works
 
WT - Web & Working of Search Engine
WT - Web & Working of Search EngineWT - Web & Working of Search Engine
WT - Web & Working of Search Engine
 
SEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEOSEOMoz The Beginners Guide To SEO
SEOMoz The Beginners Guide To SEO
 

The ultimate guide to the invisible web

  • 1. The Ultimate Guide to the Invisible Web Published on Monday 18th of December, 2006 from OEDb.org When you use a search engine on the Internet and can't find what you're looking for, what do you do? Maybe you're seeking to learn something, which means you're probably going to keep trying until you find it. Or give up in frustration. Don't give up that easily. There's information out there that is actually not indexed in the big search engines. Such Web pages are part of what's called the Dark, Deep, Hidden or Invisible Web. Those pages that are actually indexed are known by some as the surface Web. Fortunately, the invisible Web is getting easier to search, with tools beyond the standard big three search engines such as Google, Yahoo, and MSN. In the early days of the Web, computing power and storage space was at such a premium that the few search engines that were around often indexed only a tiny fraction of Web pages and not even full pages at that. But eventually space became relatively cheap and engines started indexing pages in full (full text), as well as more pages. Still, engines miss a lot of pages. Here's a guide to those "invisible" pages. Background of the Invisible Web 1. The term. "Invisible" is purely search engine-centric, indicating any Web page that can be accessed by at least one person but which is not indexed in a search engine. Many people prefer the term "Deep Web" instead. 2. Its size. No one knows for sure. Danny Sullivan, a search engine expert and formerly of Search Engine Watch, wrote in 2000 that the invisible Web was about 500 times Google's index of one billion pages. New estimates [NY Times' free registration may be required] of Google's index sets it at over 8 billion at the time of this writing. (Claims by its archival archrival Yahoo! of 19+ billion pages were considered questionable.) Search engines are said to only crawl 16-20% of the Internet 3. Its real size. The most likely entity to be able to make any sort of "accurate" estimate is Google, though if they've made a recent estimate of either the current size or growth rate of the invisible Web, that information itself appears to be invisible. (They would have a list somewhere of never-crawled URLs, which would be a mere starting point, as there would also be all those countless URLs even they cannot get to. Without this, how could an estimate be calculated?) 4. A guesstimate. Any astute mathematician with an understanding of Web content management systems, content databases, and dynamically-served Web pages would probably say between 1 and 4 trillion pages, then conclude the near impossibility of an accurate estimate, especially because of the rapidly increasing number of invisible sites. It's easier to compare search engine index size. 5. Example of futility of estimating. A library or museum gets gifted with a collection of one million digital images and decides to create a Web-accessible database. Each image will have its
  • 2. own dynamically-served page, accessible via a query form. Just like that, one million new pages have been added to the invisible Web. 6. How many invisible sites. In the same article by Danny Sullivan (above), he indicates BrightPlanet's estimate of 100,000 as being the number of "significant invisible websites" out of about 200,000. That was in 2000, so it's a hopelessly outdated estimate. Since then, weblogs have been added to the mix, and many of them go uncrawled, increasing the number of invisible sites. 7. Rate of growth of invisible sites. Technorati's David Sifry said that10,0000 new blogs are created daily as of October 2006, but he also said 175,000 daily as of July 2006. Even at the lower figure, if at least one page on each new blog is never indexed, the size of the invisible Web is growing at around 36.5 million new pages per year. That doesn't even include other types of invisible content (described elsewhere in this article). 8. Will this change? Google recently filed a patent application related to searching content through Web-based forms. SEO by the SEA speculates that they are planning to index more of the invisible Web and goes on to explain a possible methodology. Google's Eric Schmidt (or possibly founders Larry Page and Sergey Brin) has said Google is dedicated to indexing the world's content, however long it takes. Also, more previously invisible pages are getting indexed because of manually-added links to them from visible pages. 9 Reasons a Web Page is Invisible By "invisible", this does not mean a Web page is necessarily inaccessible. It simply means it's not indexed by a search engine and is thus "invisible" to a searcher who does not know of its existence. There are several reasons why a page may be invisible. Keep in mind that some pages are only temporarily invisble, possibly being indexed at a later date. The general rule of thumb is that just because a search engine finds no results does not mean it's not there. The list below also includes examples of content types gleaned from Internet Tutorials. 1. Dynamic URLs. Engines have traditionally ignored any Web pages whose URLs have a long string of parameters and equal signs and question marks, on the off chance that they'll duplicate what's in their database — or worse — the spider will somehow go around in circles. Danny Sullivan refers to such pages as part of the "shallow web". 2. Form-controlled entry, non-passworded. In this case, page content only gets displayed when a human applies a set of actions, mostly entering data into a form (specific query information, such as job criteria for a job search engine). This typically includes databases that generate pages on demand and hence cannot be indexed by a spider. Applicable content includes travel industry data (flight info, hotel availability), job listings, product databases, patents, publicly- accessible governent information, dictionary definitions, laws, stock market data, phone books and professional directories. 3. Passworded access, subscription or non subscription. This includes VPN (virtual private networks) and any Web site where some pages require username and password information. Access may or may not be by paid subscription. However, BrightPlanet found in 2001 that 95% of the invisible Web is publicly accessible without fees or subscriptions. Applicable content
  • 3. includes academic and corporate databases, newspaper or journal content, and academic library subscriptions. 4. Time-limited access. On some sites, such as the New York Times or Marketing Profs, content becomes inaccessible after a certain time without a password. Search engines retain the URL, but the page generates a sign-up form, and the content is moved to a new URL that requires a password. Note that the content is sometimes cached by an engine. The NY Times also has alternate URLs to some time-dated content that show the original content without a password. You just have to know how to get to it. 5. Too new. If a site is relatively new, it's likely that most or none of its Web pages will be indexed by any engine. This results in the site's pages being mostly invisible for a short period of time (2- 6 months). 6. Robots exclusion. The robots.txt file, which usually lives in the main directory of a Web site, tells search robots which files and directories should not be indexed. Hence its name "robots exclusion file." If this file is setup, it will block certain pages from being indexed, which will hence be invisible to searchers. 7. Flash presentation. Text content in Flash presentations is not indexed, though additional meta- information might be. 8. Geo-tagged. A site's Web server can check for the supposed geographic location, via the IP address, of a visitor's computer. Those computers from certain regions can be blocked out. That may include blocking some search engines. For example, several American TV broadcasters are now showing video online, but the pages are only accessible to US citizens, sometimes only in certain regions or certain states. 9. Hidden pages. One of the simplest and most common reasons for invisible Web pages is that they are hidden. That is, there is simply no sequence of hyperlink clicks that could take you to such a page. The pages are accessible, but only people who know of their existence know how to view them. 10 Ways to Make Invisible Content Visible We have discussed what type of content is invisible and where we might find such information. Now imagine if there were some way to make some of that invisible content more visible. That's possible for some Web pages. 1. Do a static dump. If you have a small database of content, you may want to simply dump it out to one static HTML page, with relevant formatting and necessary hyperlinks, then link to this static page from an already "visible" (indexed) page. 2. Do categorized database publishing. If you have a database of, say, products, you could publish select information to static category and overview pages, thereby making content available without form-based or query-generated access. Of course, this works best for information that does not become outdated. Job listings, for example, may not suit this method. 3. Convert formats. Word processors, spreadsheets, slideshows, PDFs, audio, video all used to be part of the invisible Web. However, Google and other text search engines started indexing their contents a few years ago, adding to the available pages of the visible Web. The benefit to librarians and researchers, etc., is that it's now easier to find a particular piece of text. But if you
  • 4. have a format such as Flash, which isn't indexed, you could publish a static version of the text content, to supplement the rich media. 4. Transcribe information. Have audio or video content such as a podcast? Transcribe the information and publish it as supplementary text. 5. Build links. Link to your own pages from other related pages. If you write about, say, trees on page A, then write about trees again on page B, link from page B to page A to give A more relevance. If page A hasn't been indexed, it will be after B is indexed. Points 6-9 are alternate ways to build links, hence helping make content visible. 6. Publish a sitemap. Not the new XML kind that the Big 3 search engines agreed to a standard on, but an HTML page that maps out the main sections of your site. This is essentially a way to build links (#5). Each main section will in turn link to specific pages. The result is that a spider has a relevance map with which to decide what to index. Then again, you can also use the new type of sitemap to achieve deep indexing. Chris Pearson offers a sitemap generator and template. 7. Build a topic pyramid. This is a specialized form of sitemap that actually spans many pages. The apex (top-most) page has general topics and links to the next layer of pages, which have more specific topics and links to the next layer. The bottom-most layer of the topic pyramid are your original Web pages or blog posts, which have the most specific content. This method builds page relevance via the serial linking, which induces spiders to want to visit and index. 8. Write about it elsewhere. This is a form of link-building. When someone writes about an invisible page and links to it, it becomes visible by proxy, once an engine follows through and indexes it. 9. Socially bookmark it. If you find something, say a book at The Gutenberg Project, that you like, bookmark the URL at a social bookmarking site such as Del.icio.us and a brief description. 10. Remove access restrictions. Get rid of the need to login, or don't apply time-limits. How to Access and Search for Invisible Content If a site publisher does none of the above to make their content more accessible, there are still ways to make the content available, if not the actual pages. Imagine if there was a search engine that could help you access some of the invisible Web. It would have an advantage over traditional engines. Well there's more than one such engine, and even traditional engines are making a move in that direction. The larger engines already index rich media such as PDF files, word processor documents, spreadsheets, etc. Invisible Web engines have taken a different approach, collaborating with Web site publishers to index the otherwise invisible content. But for invisible content that cannot and/or should not be made visible, there are still a number of ways to get access:  Be a student, alumnus, or professor to gain access to university records and library journals.  Be an employee of a company with a VPN over the Internet.  Request access. This might be as simple as signing up for free.  Pay for a subscription.
  • 5. Request a "dump" page of a database. Sometimes a request to the right person will gain you this data.  Use a Deep Web engine, portal, or directory. To actually search for effectively invisible content: 1. Use a site's search engine. These tend not to be as robust for complex query terms, and usually are quite literal about the search string, but they are more likely to show you where invisible content is than a regular engine. 2. Use site archive navigation. On weblogs in particular, you can use the archive links to find info, albeit through manual searching. 3. Use the word "database". Using the word "database" in your regular search engine query will often find you information that is otherwise nearly impossible to find. For example, if you are looking for a database of images, you can type the search string images database into Google or one of the other engines. Somewhere down the results list in Google, you'll find Full-Text Database Images from the USPTO (US Patent and Trademark Office). You can then use the Quick or Advanced search forms to find patents relating to one or more terms. If there are images to be seen, there will be links to them. 4. Use a suitable resource. Use an "invisible Web" directory, portal or specialized search engine such as Google Book Search, Google Scholar, Librarian's Internet Index, or BrightPlanet's Complete Planet (70,000 searchable databases and specialty search engines). 15 Invisible Web Search Tools BrightPlanet estimated in 2001 that in excess of 200,000 "Deep Web" sites existed. They found that 60 of the largest of these sites collectively contained 40 times the pages in the surface Web (at the time), and that despite being invisible in the engines, receive a significant amount of traffic. Here is a small sampling of invisible Web search tools (directories, portals, engines) to help you find some invisible content. To see more like these, please look at our Research Beyond Google article. 1. Deep Web Search Engine — Clusty. 2. Art — Musie du Louvre. 3. Books Online — The Online Books Page. 4. Business — Explorit Now!. 5. Consumer — US Consumer Products Safety Commission Recalled Products. 6. Economic and Job Data — FreeLunc.com — A searchable directory of free economic data. 7. Finance and Investing — Bankrate.com. 8. General Research — GPO's Catalog of US Government Publications. 9. Government Data — Copyright Records (LOCIS). 10. International — International Data Base (IDB). 11. Law and Politics — THOMAS (Library of Congress). 12. Library of Congress — Library of Congress. 13. Medical and Health — PubMed. 14. Science — ScienceResearch.com.
  • 6. 15. Transportation — FAA Flight Delay Information. References and Resources These are relevant references that are not linked to above, which may be of interest to writers and researchers. There's a strong leaning to research papers here, some of which have dozens of links to PDF documents on the technical aspects of accessing, indexing and retrieving deep Web content. A few references below are to companies offering "internet intelligence" tools and software. 1. About WebSearch — Christmas 2006 web search guide. 2. About Websearch — The deep web — find out more about the deep web — deep web search. 3. ALA — American Library Association. 4. BrightPlanet — FAQ. 5. Deep Web Research — A gigantic list of resources. 6. Deep Web Technologies. 7. Ellipsis — Metadata, Google, and the Invisible Web. 8. Envisional. 9. Google Librarian Center. 10. Google Library Project. 11. Lifehacker — How to search the invisible web. 12. MediaBistro — Some resources for freelancers. 13. MetaQuerier — Exploring and integrating the deep web. 14. QProber — Classifying and searching hidden-web text databases. 15. The Invisible Web Weblog. 16. University of California, Berkeley — Invisible or deep web. Did you enjoy this article? Bookmark it at del.icio.us » Browse Our Library Categories:  Beginning Online Learning  Choosing a Degree  Choosing a Program  Choosing a School  College Basics  Continuing Education for Adults  Distance vs. Local Education  Features  Financial Aid Information  Military Assistance Degrees  Online Class Assignments  Starting a Career
  • 7. Featured Online Colleges Kaplan University Online Choose from over 70 programs, online or at one of over 75 campuses. American InterContinental University Online Complete your degree faster than typically possible, at home! South University Established in 1899, South has many convenient online degree programs. Online Colleges Online Degrees Online Programs Library Rankings Financial Aid Blog Copyright © 2006-2010 OEDb - Accredited Online, Specialty, and Campus-Based Colleges