Metadata and the web


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Metadata and the web

  1. 1. Richard Sapon-White March 18, 2013
  2. 2.  The growth of the Web Metadata in the context of the Web Important metadata schemes: XML, HTML, MARC
  3. 3.  From 1996-2007: ◦ 77,138 Web sites  125 million Web sites Access provided through search engines ◦ Google is the most used search engine Search engines use Web crawlers (a.k.a., spiders or robots) to collect information on web sites ◦ Copy web pages and locations to build a catalog of indexed pages
  4. 4.  =Invisible Web, Deep Web Web crawlers cannot: ◦ submit queries to databases, ◦ parse file formats that they do not recognize, ◦ click buttons on Web forms, or ◦ Log-in to sites requiring authentication Therefore, much of the information on the Web is invisible! ◦ How much is invisible? ◦ Thousands of times larger than the indexed/visible web!!
  5. 5. • Topic Databases — subject-specific aggregations of information, such as SEC corporate filings, medical databases, patent records, etc.• Internal site — searchable databases for the internal pages of large sites that are dynamically created, such as the knowledge base on the Microsoft site.• Publications — searchable databases for current and archived articles.• Shopping/Auction.• Classifieds.• Portals — broader sites that included more than one of these other categories in searchable databases.• Library — searchable internal holdings, mostly for university libraries.• Yellow and White Pages — people and business finders.• Calculators — while not strictly databases, many do include an internal data component for calculating results. Mortgage calculators, dictionary look-ups, and translators between languages are examples.• Jobs — job and resume postings.• Message or Chat .• General Search — searchable databases most often relevant to Internet search topics and information.From: Michael K. Bergman, "The Deep Web: Surfacing Hidden Value," Journal of Electronic Publishing 7, no. 1 (August 2001).
  6. 6.  Poor site design results in invisible web sites To create web sites for human and machine retrieval: ◦ Use hyperlinked hierarchies of categories ◦ Contribute Deep Web collections’ metadata to union catalogs (which can then be indexed by search engines) Google’s Sitemap can provide detailed list of pages on a site  
  7. 7.  Create conventional, MARC-based metadata Access via library catalogs, union catalogs Problems: ◦ Creating MARC records is labor-intensive, slow, expensive ◦ Web sites are dynamic (content, URL’s), require MARC records to be revised Solutions: ◦ Dublin Core ◦ META tags ◦ Resource Description Framework
  8. 8.   Dublin Core PowerPoint
  9. 9.  Embed 2 metadata elements in HTML <Head> section of web page ◦ Keywords ◦ Description Example: ◦ <META NAME="KEYWORDS" CONTENT="data standards, metadata, Web resources, World Wide Web, cultural heritage information, digital resources, Dublin Core, RDF, Semantic Web"> <META NAME="DESCRIPTION" CONTENT="Version 3.0 of the site devoted to metadata: what it is, its types and uses, and how it can improve access to Web resources; includes a crosswalk.">