The ultimate guide to the invisible webDocument Transcript
The Ultimate Guide to the Invisible WebPublished on Monday 18th of December, 2006 from OEDb.orgWhen you use a search engine on the Internet and cant find what youre looking for, what do you do?Maybe youre seeking to learn something, which means youre probably going to keep trying until youfind it. Or give up in frustration. Dont give up that easily. Theres information out there that is actuallynot indexed in the big search engines. Such Web pages are part of whats called the Dark, Deep,Hidden or Invisible Web. Those pages that are actually indexed are known by some as the surface Web.Fortunately, the invisible Web is getting easier to search, with tools beyond the standard big threesearch engines such as Google, Yahoo, and MSN.In the early days of the Web, computing power and storage space was at such a premium that the fewsearch engines that were around often indexed only a tiny fraction of Web pages and not even fullpages at that. But eventually space became relatively cheap and engines started indexing pages in full(full text), as well as more pages. Still, engines miss a lot of pages. Heres a guide to those "invisible"pages.Background of the Invisible Web 1. The term. "Invisible" is purely search engine-centric, indicating any Web page that can be accessed by at least one person but which is not indexed in a search engine. Many people prefer the term "Deep Web" instead. 2. Its size. No one knows for sure. Danny Sullivan, a search engine expert and formerly of Search Engine Watch, wrote in 2000 that the invisible Web was about 500 times Googles index of one billion pages. New estimates [NY Times free registration may be required] of Googles index sets it at over 8 billion at the time of this writing. (Claims by its archival archrival Yahoo! of 19+ billion pages were considered questionable.) Search engines are said to only crawl 16-20% of the Internet 3. Its real size. The most likely entity to be able to make any sort of "accurate" estimate is Google, though if theyve made a recent estimate of either the current size or growth rate of the invisible Web, that information itself appears to be invisible. (They would have a list somewhere of never-crawled URLs, which would be a mere starting point, as there would also be all those countless URLs even they cannot get to. Without this, how could an estimate be calculated?) 4. A guesstimate. Any astute mathematician with an understanding of Web content management systems, content databases, and dynamically-served Web pages would probably say between 1 and 4 trillion pages, then conclude the near impossibility of an accurate estimate, especially because of the rapidly increasing number of invisible sites. Its easier to compare search engine index size. 5. Example of futility of estimating. A library or museum gets gifted with a collection of one million digital images and decides to create a Web-accessible database. Each image will have its
own dynamically-served page, accessible via a query form. Just like that, one million new pages have been added to the invisible Web. 6. How many invisible sites. In the same article by Danny Sullivan (above), he indicates BrightPlanets estimate of 100,000 as being the number of "significant invisible websites" out of about 200,000. That was in 2000, so its a hopelessly outdated estimate. Since then, weblogs have been added to the mix, and many of them go uncrawled, increasing the number of invisible sites. 7. Rate of growth of invisible sites. Technoratis David Sifry said that10,0000 new blogs are created daily as of October 2006, but he also said 175,000 daily as of July 2006. Even at the lower figure, if at least one page on each new blog is never indexed, the size of the invisible Web is growing at around 36.5 million new pages per year. That doesnt even include other types of invisible content (described elsewhere in this article). 8. Will this change? Google recently filed a patent application related to searching content through Web-based forms. SEO by the SEA speculates that they are planning to index more of the invisible Web and goes on to explain a possible methodology. Googles Eric Schmidt (or possibly founders Larry Page and Sergey Brin) has said Google is dedicated to indexing the worlds content, however long it takes. Also, more previously invisible pages are getting indexed because of manually-added links to them from visible pages.9 Reasons a Web Page is InvisibleBy "invisible", this does not mean a Web page is necessarily inaccessible. It simply means its notindexed by a search engine and is thus "invisible" to a searcher who does not know of its existence.There are several reasons why a page may be invisible. Keep in mind that some pages are onlytemporarily invisble, possibly being indexed at a later date. The general rule of thumb is that justbecause a search engine finds no results does not mean its not there. The list below also includesexamples of content types gleaned from Internet Tutorials. 1. Dynamic URLs. Engines have traditionally ignored any Web pages whose URLs have a long string of parameters and equal signs and question marks, on the off chance that theyll duplicate whats in their database — or worse — the spider will somehow go around in circles. Danny Sullivan refers to such pages as part of the "shallow web". 2. Form-controlled entry, non-passworded. In this case, page content only gets displayed when a human applies a set of actions, mostly entering data into a form (specific query information, such as job criteria for a job search engine). This typically includes databases that generate pages on demand and hence cannot be indexed by a spider. Applicable content includes travel industry data (flight info, hotel availability), job listings, product databases, patents, publicly- accessible governent information, dictionary definitions, laws, stock market data, phone books and professional directories. 3. Passworded access, subscription or non subscription. This includes VPN (virtual private networks) and any Web site where some pages require username and password information. Access may or may not be by paid subscription. However, BrightPlanet found in 2001 that 95% of the invisible Web is publicly accessible without fees or subscriptions. Applicable content
includes academic and corporate databases, newspaper or journal content, and academic library subscriptions. 4. Time-limited access. On some sites, such as the New York Times or Marketing Profs, content becomes inaccessible after a certain time without a password. Search engines retain the URL, but the page generates a sign-up form, and the content is moved to a new URL that requires a password. Note that the content is sometimes cached by an engine. The NY Times also has alternate URLs to some time-dated content that show the original content without a password. You just have to know how to get to it. 5. Too new. If a site is relatively new, its likely that most or none of its Web pages will be indexed by any engine. This results in the sites pages being mostly invisible for a short period of time (2- 6 months). 6. Robots exclusion. The robots.txt file, which usually lives in the main directory of a Web site, tells search robots which files and directories should not be indexed. Hence its name "robots exclusion file." If this file is setup, it will block certain pages from being indexed, which will hence be invisible to searchers. 7. Flash presentation. Text content in Flash presentations is not indexed, though additional meta- information might be. 8. Geo-tagged. A sites Web server can check for the supposed geographic location, via the IP address, of a visitors computer. Those computers from certain regions can be blocked out. That may include blocking some search engines. For example, several American TV broadcasters are now showing video online, but the pages are only accessible to US citizens, sometimes only in certain regions or certain states. 9. Hidden pages. One of the simplest and most common reasons for invisible Web pages is that they are hidden. That is, there is simply no sequence of hyperlink clicks that could take you to such a page. The pages are accessible, but only people who know of their existence know how to view them.10 Ways to Make Invisible Content VisibleWe have discussed what type of content is invisible and where we might find such information. Nowimagine if there were some way to make some of that invisible content more visible. Thats possible forsome Web pages. 1. Do a static dump. If you have a small database of content, you may want to simply dump it out to one static HTML page, with relevant formatting and necessary hyperlinks, then link to this static page from an already "visible" (indexed) page. 2. Do categorized database publishing. If you have a database of, say, products, you could publish select information to static category and overview pages, thereby making content available without form-based or query-generated access. Of course, this works best for information that does not become outdated. Job listings, for example, may not suit this method. 3. Convert formats. Word processors, spreadsheets, slideshows, PDFs, audio, video all used to be part of the invisible Web. However, Google and other text search engines started indexing their contents a few years ago, adding to the available pages of the visible Web. The benefit to librarians and researchers, etc., is that its now easier to find a particular piece of text. But if you
have a format such as Flash, which isnt indexed, you could publish a static version of the text content, to supplement the rich media. 4. Transcribe information. Have audio or video content such as a podcast? Transcribe the information and publish it as supplementary text. 5. Build links. Link to your own pages from other related pages. If you write about, say, trees on page A, then write about trees again on page B, link from page B to page A to give A more relevance. If page A hasnt been indexed, it will be after B is indexed. Points 6-9 are alternate ways to build links, hence helping make content visible. 6. Publish a sitemap. Not the new XML kind that the Big 3 search engines agreed to a standard on, but an HTML page that maps out the main sections of your site. This is essentially a way to build links (#5). Each main section will in turn link to specific pages. The result is that a spider has a relevance map with which to decide what to index. Then again, you can also use the new type of sitemap to achieve deep indexing. Chris Pearson offers a sitemap generator and template. 7. Build a topic pyramid. This is a specialized form of sitemap that actually spans many pages. The apex (top-most) page has general topics and links to the next layer of pages, which have more specific topics and links to the next layer. The bottom-most layer of the topic pyramid are your original Web pages or blog posts, which have the most specific content. This method builds page relevance via the serial linking, which induces spiders to want to visit and index. 8. Write about it elsewhere. This is a form of link-building. When someone writes about an invisible page and links to it, it becomes visible by proxy, once an engine follows through and indexes it. 9. Socially bookmark it. If you find something, say a book at The Gutenberg Project, that you like, bookmark the URL at a social bookmarking site such as Del.icio.us and a brief description. 10. Remove access restrictions. Get rid of the need to login, or dont apply time-limits.How to Access and Search for Invisible ContentIf a site publisher does none of the above to make their content more accessible, there are still ways tomake the content available, if not the actual pages.Imagine if there was a search engine that could help you access some of the invisible Web. It wouldhave an advantage over traditional engines. Well theres more than one such engine, and eventraditional engines are making a move in that direction. The larger engines already index rich mediasuch as PDF files, word processor documents, spreadsheets, etc.Invisible Web engines have taken a different approach, collaborating with Web site publishers to indexthe otherwise invisible content. But for invisible content that cannot and/or should not be madevisible, there are still a number of ways to get access: Be a student, alumnus, or professor to gain access to university records and library journals. Be an employee of a company with a VPN over the Internet. Request access. This might be as simple as signing up for free. Pay for a subscription.
Request a "dump" page of a database. Sometimes a request to the right person will gain you this data. Use a Deep Web engine, portal, or directory.To actually search for effectively invisible content: 1. Use a sites search engine. These tend not to be as robust for complex query terms, and usually are quite literal about the search string, but they are more likely to show you where invisible content is than a regular engine. 2. Use site archive navigation. On weblogs in particular, you can use the archive links to find info, albeit through manual searching. 3. Use the word "database". Using the word "database" in your regular search engine query will often find you information that is otherwise nearly impossible to find. For example, if you are looking for a database of images, you can type the search string images database into Google or one of the other engines. Somewhere down the results list in Google, youll find Full-Text Database Images from the USPTO (US Patent and Trademark Office). You can then use the Quick or Advanced search forms to find patents relating to one or more terms. If there are images to be seen, there will be links to them. 4. Use a suitable resource. Use an "invisible Web" directory, portal or specialized search engine such as Google Book Search, Google Scholar, Librarians Internet Index, or BrightPlanets Complete Planet (70,000 searchable databases and specialty search engines).15 Invisible Web Search ToolsBrightPlanet estimated in 2001 that in excess of 200,000 "Deep Web" sites existed. They found that 60of the largest of these sites collectively contained 40 times the pages in the surface Web (at the time),and that despite being invisible in the engines, receive a significant amount of traffic. Here is a smallsampling of invisible Web search tools (directories, portals, engines) to help you find some invisiblecontent. To see more like these, please look at our Research Beyond Google article. 1. Deep Web Search Engine — Clusty. 2. Art — Musie du Louvre. 3. Books Online — The Online Books Page. 4. Business — Explorit Now!. 5. Consumer — US Consumer Products Safety Commission Recalled Products. 6. Economic and Job Data — FreeLunc.com — A searchable directory of free economic data. 7. Finance and Investing — Bankrate.com. 8. General Research — GPOs Catalog of US Government Publications. 9. Government Data — Copyright Records (LOCIS). 10. International — International Data Base (IDB). 11. Law and Politics — THOMAS (Library of Congress). 12. Library of Congress — Library of Congress. 13. Medical and Health — PubMed. 14. Science — ScienceResearch.com.
15. Transportation — FAA Flight Delay Information.References and ResourcesThese are relevant references that are not linked to above, which may be of interest to writers andresearchers. Theres a strong leaning to research papers here, some of which have dozens of links toPDF documents on the technical aspects of accessing, indexing and retrieving deep Web content. A fewreferences below are to companies offering "internet intelligence" tools and software. 1. About WebSearch — Christmas 2006 web search guide. 2. About Websearch — The deep web — find out more about the deep web — deep web search. 3. ALA — American Library Association. 4. BrightPlanet — FAQ. 5. Deep Web Research — A gigantic list of resources. 6. Deep Web Technologies. 7. Ellipsis — Metadata, Google, and the Invisible Web. 8. Envisional. 9. Google Librarian Center. 10. Google Library Project. 11. Lifehacker — How to search the invisible web. 12. MediaBistro — Some resources for freelancers. 13. MetaQuerier — Exploring and integrating the deep web. 14. QProber — Classifying and searching hidden-web text databases. 15. The Invisible Web Weblog. 16. University of California, Berkeley — Invisible or deep web.Did you enjoy this article? Bookmark it at del.icio.us »Browse Our Library Categories: Beginning Online Learning Choosing a Degree Choosing a Program Choosing a School College Basics Continuing Education for Adults Distance vs. Local Education Features Financial Aid Information Military Assistance Degrees Online Class Assignments Starting a Career