The Mysteries of Metadata


Published on

Amit Sheth, "The Mysteries of Metadata,"
Workshop (Tutorial) at Content World 2001, Burlingame, CA. May 15, 2001

Published in: Education

The Mysteries of Metadata

  1. 1. The Mysteries of MetadataWorkshop at Content World 2001, Burlingame, CA. May 15, 2001 Amit Sheth Founder/CEO, Taalee ( [Taalee is now Semagix: ] Also, Director, Large Scale Distributed Information Systems (LSDIS) Lab, University Of Georgia ( Metadata Extraction is a patented technology of Taalee, Inc. Semantic Engine and WorldModel are trademarks of Taale. Inc. Confidential HP
  2. 2. Workshop AgendaWhat is Metadata ?Metadata Descriptions and StandardsMetadata Storage/Exchange/Infrastructure(Automated) Metadata Creation/Extraction/TaggingMetadata Usage/Applications HP 2
  3. 3. What is Metadata? Data about data Statements, contexts Recursive – data about “data about data” Applications Content management Cataloguing Information retrieval, search … "A Web content repository without metadata is like a library without an index," - Jack Jia, IWOV HP 3
  4. 4. Information Interoperability:key metadata objective and benefitSystemSyntaxStructureSemantics Protocols Metadata Domain Modeling, Ontologies HP 4
  5. 5. SemanticsMeaning, UnderstandingFacts, Context, ReasoningRelated to: exchange, usage, application HP 5
  6. 6. A metadata classification User Ontologies Classifications Move in this Domain Models direction to Domain Specific Metadata tackle area, population (Census), information land-cover, relief (GIS),metadataoverload!! concept descriptions from ontologies Domain Independent (structural) Metadata (C++ class-subclass relationships, HTML/SGML Document Type Definitions, C program structure...) Direct Content Based Metadata (inverted lists, document vectors, WAIS, Glimpse, LSI) Content Dependent Metadata (size, max colors, rows, columns...) Content Independent Metadata (creation-date, location, type-of-sensor...) Data (Heterogeneous Types/Media) HP 6
  7. 7. Types of Metadata for digital mediaMedia type-specific metadata eg.,texture of images,font size…Media processing-specific metadata eg.,search, retrieval, personalized filteringContent Specific metadata eg.,rocket related video and documents HP 7
  8. 8. Metadata for Digital DataMetadata Data Type Metadata TypeQ-Features [Jain and Ham papur] Im age, Video Dom ain SpecificR-Features [Jain and Ham papur] Im age, Video Dom ain IndependentM eta-Features [Jain and Ham papur] Im age, Video Content IndependentIm pression Vector [Kiyoki et al.] Im age Content DescriptiveNDVI, Spatial Registration [Anderson and Stonebraker] Im age Dom ain SpecificSpeech Feature Index [Glavitsch et al.] Audio Direct Content BasedTopic Change Indices [Chen et al.] Audio Direct Content BasedDocum ent Vectors [ Deerwester et al.] Text Direct Content BasedInverted Indices [Kahle and M edlar] Text Direct Content BasedContent Classification M etadata [Bohm and Rakow] M ultiM edia Dom ain SpecificDocum ent Com position M etadata [Bohm and Rakow] M ultiM edia Dom ain IndependentM etadata Tem plates [Ordille and M iller] M edia Independent Dom ain SpecificLand Cover, Relief [Sheth and Kashyap] M edia Independent Dom ain SpecificParent Child Relationships [Shklar et al.] Text Dom ain IndependentContexts [Sciore et al., Kashyap and Sheth] Structured Dom ain SpecificConcepts from Cyc [Collet et al.] Structured Dom ain SpecificUser’s Data Attributes [Shoens et al.] Text, Structured Dom ain SpecificDom ain Specific Ontologies [M ena et al.] M edia Independent Dom ain Specific HP 8
  9. 9. Types of Specs and Standards(or MetaModels)Domain Independent: (MCF), RDF, MOF, DublinCoreMedia Specific: MPEG4, MPEG7, VoiceXMLDomain/Industry Specific (metamodels): MARC (Library),FGDC and UDK (Geographic), NewsML (News), PRISM(Publishing)Application Specific: ICE (Syndication)Exchange/Sharing: XCM, XMIOrthogonal/(Other): RDFS, namespaces, ontologies,domain models, (DAML, OIL) HP 9
  10. 10. what RDF can do for metadata ?Designed to impose structural constraint on syntax tosupport consistent encoding, exchange and processingof metadata.Domain Independent Metadata standard. HP 10
  11. 11. RDF (Resource Description Format) Property Resource Value•RDF data consists of nodes and attached attribute/value pairs •Nodes can be any web resources (pages, servers, basically anything for which you can give a URI), even other instances of metadata. •Attributes are named properties of the nodes, and their values are either atomic (text strings, numbers, etc.) or other resources or metadata instances. HP 11
  12. 12. RDF Example 1 dc:title Mysteries of Metadata URI:TALK dc:creator URI:AMIT<?XML version=‘1.0’?><rdf:RDF xmlns:rdf = “”xmlns:dc = “”><rdf:Description rdf:about = “URI:TALK”><dc:title>Mysteries of Metadata</dc:title><dc:creator rdf:resource = “URI:AMIT”/></rdf:Description></rdf:RDF> HP 12
  13. 13. RDF Example 2 dc:title Mysteries of Metadata URI:TALK dc:creator URI:AMIT BIB:Aff BIB:Email BIB:Name URI:LIB Amit Sheth HP 13
  14. 14. RDFS (RDF Schema)Enables resource description communities to define(and share) vocabularies (museum, library, e-commerce…)Vocabulary (in RDFS) = the meaning, characteristics,and relationships of a set of properties. HP 14
  15. 15. RDF Based Web RDF Schemas RDF/XML Descriptions Resources HTML Source: HP 15
  16. 16. Dublin Core Metadata InitiativeSimple element set designed for resource descriptionInternational, inter-discipline, W3C communityconsensus“Semantic” interface among resource descriptioncommunities (very limited form of semantics) HP 16
  17. 17. Dublin Core RDF<xml><?namespace href = "" as = "RDF"><?namespace href = "" as = "DC"><RDF:Abbreviated><RDF:Assertion RDF:HREF = = "Ive Never Metadata Ive Never Liked“DC:Creator = "Mary Crystal“DC:Subject = "Metadata, Dublin Core, Stuff"/></RDF:Abbreviated></xml> HP 17
  18. 18. MOF (Metadata Object Facility) and XMIMOF models metadata using a subset of UML that isrelevant to modeling metadata (class models - classes,associations and subtyping), a set of rules for mappingthe elements of the MOF Core to CORBA IDLXML Metadata Interchange (XMI) is an extension of theMOF into the XML space HP 18
  19. 19. NewsMLNewsML is a packaging and metadata format for newscontent.NewsML is developed by the International PressTelecommunications Council (IPTC), a consortium ofnews providers, mostly in the print or wire-serviceindustries.Since it deals only with packaging and metadata,NewsML is complementary both to news contentformats like NITF and to syndication protocols like ICE. HP 19
  20. 20. NewsML… It can be used by news providers to combine their pictures, video, text, graphics and audio files in news output available on web sites, mobile phones, high end desktops interactive television and any other device. accurate, objective set of description tools, which help qualify the information and make the search more precise. NewsML allows a range of metadata to be attached to a multi-media story, including a detailed computer- readable description of what an item is about. HP 20
  21. 21. Example of the end-to-end flow - NewsMLThe content provider The operator receives Consumers sign up for thesupplies NewsML packaged NewsML data from the news service directly on themedia content to the content provider. The device. When using the newsoperator. The content is content server automatically service, the user browsescategorized as current pushes updated news articles through the categories andevents, finance, sport, etc. to all news service reads the news articles. Theand updated hourly. subscribers. news articles are presented in a continuous flow (one after the other) without end-user interaction. Source: HP 21
  22. 22. PRISMPublishing Requirements for Industry StandardMetadataVersion: 1.0, April 2001Authors: IDEAlliance (Adobe, Vignette, Kinecta et al.)Idea: “a standard for interoperable contentdescription, interchange, and reuse in bothtraditional and electronic publishing contexts”Web site: HP 22
  23. 23. PRISM DesignBuilt on existing standards like Dublin Core (DC),RDF, XMLDesigned to be used in a simple, straightforward wayover the InternetCompatible with NewsMLIntegrates easily with ICE (for syndication)Vocabulary: Basic: DC Extensions: “Controlled Vocabularies”, e.g., “North American Industrial Classification System“ (NAICS) HP 23
  24. 24. PRISM Example<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:prism="" xmlns:rdf="" xmlns:dc=""> <rdf:Description rdf:about=""> <dc:identifier rdf:resource="" /> <dc:description>Photograph taken at 6:00 am on Corfu with two models </dc:description> <dc:title>Walking on the Beach in Corfu</dc:title> <dc:creator>John Peterson</dc:creator> <dc:contributor>Sally Smith, lighting</dc:contributor> <dc:format>image/jpeg</dc:format> </rdf:Description></rdf:RDF> (Source: PRISM spec v. 1; HP 24
  25. 25. VoiceXML A language for specifying voice dialogs. Voice dialogs use audio prompts and text- to- speech (TTS) for output; touch- tone keys (DTMF) and automatic speech recognition (ASR) for input. Goal is to bring the advantages of web-baseddevelopment and content delivery to interactivevoice response applications. High- level voice-specific language simplifiesapplication development. Source: HP 25
  26. 26. Voice Based InternetApplications Source: HP 26
  27. 27. Voice XML MetadataVoice Specific metadataSupports Syntactic interoperablity Text data to voice dataVoice XML = XML + Voice Metadata HP 27
  28. 28. VoiceXML – Possible Services Information retrieval – News, sports, traffic, stock quotes. e- Transactions (e- commerce, e- tailing, etc.) Financial: banking, stock trading. Catalog browsing (generally as an adjunct to paper). Telephone services Personal voice dialing, One- number find- me services. Intranet – Inventory, HR services, corporate portals. Unification – My Whatever: personal portals, personalagents, unified messaging. Source: HP 28
  29. 29. MPEG7set of description scheme and descriptors to describethe content of multimedia data.Provides a language to specify description schemesA scheme for coding the description HP 29
  30. 30. Application Examples for MPEG7A few application examples are: Digital libraries (image catalog, musical dictionary,...) Multimedia directory services (e.g. yellow pages) Broadcast media selection (radio channel, TV channel,...) HP 30
  31. 31. Information and ContentExchange (ICE)Main Goal: efficient and extensible Content Syndicationprotocol for the Internet, using XML syntaxAuthors: Adobe, Kinecta, MS, Sun, Vignette et al.Status: latest spec version 1.1, May 2000; submitted toW3C for reviewImplementations: Vignette Syndication Server, MSBizTalk, Kinecta Interact, …Web Site: HP 31
  32. 32. What is the ICE Protocol?Syndication Protocol for communication betweenSyndicators and SubscribersMetadata to define roles and responsibilities of involved parties: Subscriber vs. Syndicator, Requestor vs. Responder, Sender vs. Receiver format and method of content exchange (e.g., sequenced packages, pull vs. push model) HP 32
  33. 33. ICE ApplicationsICE vocabulary + domain vocabulary = completeapplicationICE establishes and manages the syndication delivers data logs events => content-independent metadataindustry-specific vocabulary defines the content =>domain-specific metadata Source: HP 33
  34. 34. ICE ExplainedICE: Information and Content Exchange protocolSyndicator: A content aggregator and distributorSubscriber: A content consumerSubscription: An agreement between a subscriber and a syndicatorfor the delivery of content according to the delivery policy and otherparameters in the agreementCollection: The current content of a subscriptionICE Package: A delivery of commands to update a collection suchas the addition of content itemsICE Payload: The XML document used by ICE to carry protocolinformation. Examples include requests for packages, catalogs ofsubscription offers, usage logs and other management information Sources: InternetWeek; "ICE Cookbook, version 1.0" HP 34
  35. 35. <?xml version="1.0"?><!DOCTYPE ice-payload SYSTEM "http://.../ice.dtd"><ice-payload payload-id="ipl-80a56cfe" timestamp="05-15-2001T11:00:01" ice.version="1.0" > <ice-response response-id="irp-20010515181600"> <ice-item-group group-id= "grp-8610"> <ice-item item-id="4321" subscription-element="4321" name="Cartoon" filename="demo.gif" content-type="application/xml" > <comic-strip title="Looney City" author="Amito Pateru" copyright="Taalee Makeups" pubdate="20010515"> PdXIWZQ8IiPLhHrQcrjxAQ8VquFJS8vDC … (ASCII-encoded image) </comic-strip> </ice-item> Content </ice-item-group> (domain-specific </ice-response></ice-payload> metadata)
  36. 36. XCM (eXtended Content Management)a framework that allows customers to classify contentmanagement offerings according to the business problemsthey address. The segments of XCM are Content Development - Developing static content and managing the process of its subsequent approval, versioning, storage, and retrieval. Application Content Management (Vignette) - Deploying content dynamically to a Web site and managing that content throughout its online lifecycle. Content Delivery - Delivering content through multiple channels to minimize customer waiting time and improve Web site stability and scalability. Source :,2097,1-1-30-1458-1146-1743,00.html HP 36
  37. 37. XCM eXtended Content ManagementContent Development Application Content Content Delivery Management Management Content Authoring Metadata Management Edge NetworkDigital Asset Management Recombination Delivery Software Configuration Personalization Streaming Media Management Delivery Document Process Caching Management Source : HP 37
  38. 38. Multiple heterogeneous metadata models with different tag names for the same data in the same GIS domain Kansas State FGDC Metadata Model UDK Metadata Model Theme keywords: digital line graph, Search terms: digital line graph, hydrography, transportation... hydrography, transportation... Title: Dakota Aquifer Title Topic: Dakota Aquifer Online linkage: Adress Id: Spatial Reference Method: Vector Measuring Techniques: VectorHorizontal Coordinate System Definition: Co-ordinate System: Universal Transverse Mercator Universal Transverse Mercator … … … ... … … … ... HP 38
  39. 39. Different views of Metadata Domain Independent Specifications (RDF) Frameworks/Infrastructures (XCM)Application Specific Media Specific Metadata ICE MPEG7, VoiceXML Domain Specific NewsML, FGDC/UDK HP 39
  40. 40. Creating and Serving Metadata to Power the Life-cycle of Content Taalee Infrastructure Services Taalee Content Applications Produce Catalog/ Integrate Interactive Personalize Aggregate Index Syndicate MarketingWhere is the What other What is the right What is the best way to content? content is it content for this monetize thisWhose is it? related to? user? interaction? Broadcast, Wireline, Taalee Semantic MetaBase Wireless, Interactive TV HP 40
  41. 41. Taalee’s Intelligent Content Process HP 41
  42. 42. Metadata Creation andSemanticization• Automatic Content Classification/Categorization• Metadata Creation/Extraction: Types of metadata created Semantic Engine and WorldModel are trademarks of Taalee, Inc. Metadata Extraction is a patented technology of Taalee, Inc. HP 42
  43. 43. Forms/Types/Ingest of ContentSources: Web Sites, Content Feeds and PrivateRepositoriesTypes: Text, Graphics, Audio, Video, MultimediaForms: Unstructured text, Semi-structured text,Structured text (+Media); Static or DynamicIngest: Feed (push), Web (pull),Repository/Database (usually pull) HP 43
  44. 44. Content Handling/IngestInfrastructure/Exchange Feed Handlers Crawlers/Screen Scrapers/Bots Software AgentsCentralized, Distributed, Mobile/Migratory HP 44
  45. 45. Information Extraction for Metadata Creation Nexis Digital Videos UPI AP ... ... Documents Data Stores Global/Enterprise Digital Maps Web Repositories ... Digital Images Digital Audios EXTRACTORS METADATA HP 45
  46. 46. Extracting a Text Document: Syntactic approach INCIDENT MANAGEMENT SITUATION REPORT LAYOUT Friday August 1, 1997 - 0530 MDT NATIONAL PREPAREDNESS LEVEL IICURRENT SITUATION: Alaska continues to experience large fire activity. Additional fires have beenstaffed for structure protection.SIMELS, Galena District, BLM. This fire is on the east side of the Innoko Flats, between Galena and McGrThe fore is active on the southern perimeter, which is burning into a continuous stand of black spruce. The Date => day month int ‘,’ intfire has increased in size, but was not mapped due to thick smoke. The slopover on the eastern perimeter is35% contained, while protection of the historic cabit continues.CHINIKLIK MOUNTAIN, Galena District, BLM. A Type II Incident Management Team (Wehking) isassigned to the Chiniklik fire. The fire is contained. Major areas of heat have been mopped up. The fire iscontained. Major areas of heat have been mopped-up. All crews and overhead will mop-up where the fireburned beyond the meadows. No flare-ups occurred today. Demobilization is planned for this weekend, HP 46
  47. 47. Traditional Text Categorization Customer Training Statistical/AI Set Techniques d fee Classify Place in a taxonomy Routing/Distribution CustomerArticle Feed 4715 Standard Metadata Classification of Article 4715 Feed Source: iSyndicate Posted Date: 11/20/2000
  48. 48. Taalee’s Categorization & Automatic Metadata Creation Knowledge-base & Statistical/AI Techniques Taalee Training Place in Automated Content Catalog Metadata Set Classify a taxonomy Enrichment (ACE) FTE Company Analysis Conference Calls Article 4715 Metadata Earnings Customer Standard Feed Source: iSyndicate Stock Analysis Training ed metadata Posted Date: 11/20/2000 Set Company Name: France Telecom, ENT fe Equant Company Analysis Semantic Conference Calls metadata Ticker Symbol: FTE, ENT Earnings Exchange: NYSE Stock Analysis Topic: Company News NYSE Member Companies Market News IPOs Classification of Article 4715 Taalee Enterprise Content Manager Customization Suite Precise syndication/filtering Article Feed 4715 Routing/Distribution Map to another taxonomy
  50. 50. Automatic Categorization & MetadataTagging (Web page) Video with Editorialized Text on the Web Auto AutoCategorizationCategorization Semantic Metadata Semantic Metadata HP 50
  51. 51. Automatic Categorization & Metadata Tagging (Feed) Text From Bllomberg Auto AutoCategorizationCategorization Semantic Metadata Semantic Metadata HP 51
  52. 52. Taalee Extraction and Knowledgebase EnhancementWeb Page Enhanced Metadata Asset Extraction Agent HP 52
  53. 53. Basis for SemanticsA. Facts/Concepts/Terms/Entities Dictionary, Thesaurus, Reference Data, VocabularyB. Facts with Relationships Taxonomy/(Categories), Ontology Domain Modeling (e.g., Golf = golfer, tournament name, golf course, event) Knowledge Base HP 53
  54. 54. Basis for SemanticsC. Reasoning/Inference (Statistical) (Information Retrieval) Statistical Learning/AI (Bayesian, Neural Networks, HMM,…) Logic Based (Description Logic) Natural Language/Grammar (part of speech,..) HP 54
  55. 55. Alternatives for Metadata Extraction Statistical methods/Cluster Analysis Learning/AI and Collab. FilteringWord or Phrase Reference data/Concept-terms/ Dictionary/Thesaurus By topic/industry/subject/domain Ontologies/Domain Models deeper KnowledgeBase understanding By Entities and Relationships HP 55
  56. 56. Open Directory Project (ODP):Classification/Taxonomy & Directory HP 56
  57. 57. Ontology Standardize meaning, description, representation of involved attributes Capture the semantics involved via domain characteristics Allow knowledge sharing and reuse (Ontological Commitment) HP 57
  58. 58. Ontology Description includes Attributes Domain Rules Functional Dependencies HP 58
  59. 59. An Ontology HP 59
  61. 61. Large Vocabularies/Taxonomies/Ontologies WordNet The Medical Subject Headings (MeSH): NLMs controlled vocabulary used for indexing articles, for cataloging books and other holdings, and for searching MeSH-indexed databases, including MEDLINE. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. Year 2000 MeSH includes more than 19,000 main headings, 110,000 Supplementary Concept Records (formerly Supplementary Chemical Records), and an entry vocabulary of over 300,000 terms. HP 61
  62. 62. Metadata enabledApplications Confidential HP
  63. 63. Metadata Usage:Impact on Search & Query processing traditional queries based on keywords attribute based queries content-based queries HP 63
  64. 64. Oingo.comOingo Ontology – ODP based(?), the database of millionsof concepts and relationships that powers Oingossemantic technologyOingo Seek - the database of millions of concepts andrelationships that powers Oingos semantic technologyOingo Sense - the knowledge extraction tool thatuncovers the essential meaning of information by sensingconcepts and contextOingo Lingua - the language of meaning used to stateintent. The basis for intelligent interactionAssets catalogued are Web sites or Web pages. HP 64
  65. 65. Use of Categories for Search After 3 or 4 clicks HP 65
  66. 66. Metadata is the basis of makingContent Intelligent Precisely what the user asked for Closely-related, high-value information beyond what was requested Ability to explore any dimension around the immediate point of interest Intelligent content helps the user “think” about and fulfill their information needs with less effort. Intelligent content can be more effectively managed, packaged and distributed HP 66
  67. 67. Metadata and Intelligent ContentTaalee makes content more “intelligent” through automatic analysis of everyindividual asset to generate a catalog containing: • Context of the Content • Semantic Metadata describing entities (i.e., Company, Industry, etc.), and • Relationships (semantic associations) among all entitiesBased on a “Semantic” or “domain” model describing how the user thinksabout the subject matter, supported by a knowledgebase.“Normal” Content can only be “found” if the user enters a keyword that exists within it + = Intelligent ContentAdding related metadata and relationships dramatically increases the ability to automatically access needed content via multiple dimensions HP 67
  68. 68. More than metadataTaalee makes content more “intelligent” through automatic analysis of every individual content item to create: Context of the Content Semantic Metadata describing entities (i.e., Company, Industry, etc.), and Relationships (semantic associations) among all entitiesBased on a “Semantic” or “domain” model describing how the user thinks about the subject matter, supported by a knowledgebase. HP 68
  69. 69. Metadata & SearchMetadata can improve search significantly, butmetadata enables much more than searchAlternatives for improving search: clustering, linkand other analysis (e.g., Google’s Link Fluxanalysis), classification as context, ontologies,metadata, knowledgebases … HP 69
  70. 70. Metadata Usage: Keyword, Attributeand Content Based Access HP 70
  71. 71. Keyword Search vs Attribute Search with Semantic metadata Taalee Metadata on Football AssetsMetadata from Typical Virage Search on Rich Media Reference PageCataloging of Football football touchdown Baltimore 31, Pit 24 Assets Brian Griese Interview Part Four Quandry Ismail and Tony Banks hook up for their third long Brian Griese talks about the touchdown, this time on a 76-yarder to extend the Raven’s first touchdown he ever threw. lead to 31-24 in the third quarter. URL: http://cbs.sportsline... League: Professional Teams: Ravens, Steelers Jimmy Smith Interview Part Seven Score: Bal 31, Pit 24 Jimmy Smith explains his Players: Quandry Ismail, Tony Banks philosophy on showboating. Event: Touchdown URL: http://cbs.sportsline... Produced by: Posted date: 2/02/2000 HP 71
  72. 72. Taalee’s Semantic Search Highly customizable, precise and freshest A/V search Delightful, relevant information, exceptional targeting opportunity Uniform Metadata for Content from MultipleContext and Domain Specific Attributes Sources, Can be sorted by any field HP 72
  73. 73. What can a context do? Creating a Web of related informationHP 73
  74. 74. Taalee Directory Georgia BulldogsSystem recognizes ENTITY & CATEGORY
  75. 75. Taalee DirectoryCareless whisper
  76. 76. Semantic Relationships HP 76
  77. 77. Metadata Application Example Semantic Applications for highly relevant and fresh content: Personalization and Targeting/interactive marketing Please contact Taalee for live demonstrations HP 77
  78. 78. Personalized Directory Change ContextObtain a whole universe of information (that you may not evenhave thought of) about some entities that have always been ofinterest to you.Please enter such semantic keywords below.
  79. 79. Personalized Queries & Hot Topics Personalized Queries 1. My Stock Portfolio Microsoft suffers serious hack attack Cisco Systems Inc PERSONALIZATION Analyst Safa Rashtchy on Yahoo! PeopleSoft, Inc AT&T Corp. more… 2. My Football Fantasy Team Gators Spurrier ready for big game Techs Vick looks to become complete QB Bucs excited about Hamilton HOT Topics!!! Jasper Sanks rumbles into the end zone… Edwards explains reasons for leaving BYU 1. Election 2000 more… Video: Explaining the electoral map 3. Julia Roberts Collection Race for White House hots up Movie Trailer: "Notting Hill" Gore Florida Edge Seniors Give more… Trailer - Runaway Bride 2. Middle East Peace Conflict Patrick Movie Trailer: "Stepmom" Israel steps up security More die as Israel braces for suicide bombs Conspiracy Theory more… Pentagon probes Coles security more… 4. Pink Floyd Collection 3. Napster Controversy Set the Controls for the Heart of the Sun… Wish You Were Here Brain Behind Napster The Napster Lawsuit Round And Around Keep Talking Creative Nomad II more… The Post War Dream more…
  80. 80. Metadata: Targeting HP 80
  81. 81. Semantic/Interactive Targeting Buy Al Pacino Videos Buy Russell Crowe Videos Buy Christopher Plummer Videos Buy Diane Venora Videos Buy Philip Baker Hall Videos Buy The Insider VideoPrecisely targeted through the use of Structured Metadata and integration from multiple sources
  82. 82. Web: Extreme PersonalizationRealtime Interests, Feeds PreferencesWeb sites Time-Shiftedand Pages Content Aggregator Content PersonalizedDatabases Content Content Personalized Content Semantic EngineTM Structured, Hi-Quality Semantic Metabase HP 82
  83. 83. Application of Semantic Metadata andAutomatic Content Enrichment User has already completed Web MyMedia Based registration and $ MyStocks personalization at Voquette’s News Sports Enterprise Customer site. Music User’s “Wireless Home page” shows the categories for his interests. There is an alert (new content) for his stock and sports categories. HP 83
  84. 84. Application of Semantic Metadata andAutomatic Content Enrichment Clicking on MyStocks brings My Stocks down user’s Personal Portfolio MyMedia list. The user wants to see news $ MyStocks CSCO items about Cisco (see next News NT slide). Sports IBM Search at the bottom is a Music Market semantic search that understands the financial domain, and the knowledge of user’s portfolio. Typically search can be done by typing one word or selecting from a dynamic, personalized menu. HP 84
  85. 85. Application of Semantic Metadata andAutomatic Content Enrichment Different types of recent audio content about CSCO Cisco are available. My Stocks MyMedia Analyst Call The user clicks to see a $ MyStocks CSCO Conf Call listing of Analyst Calls News NT Earnings on Cisco (next slide). Sports IBM Music Market Icons at the bottom of the screen enable contextually relevant functions: listen, set alert on story, add to playlist. HP 85
  86. 86. Application of Semantic Metadata and Automatic Content Enrichment CSCO Analysis CSCO My Stocks 11/08 ON24 Payne MyMedia Analyst Call 11/07 ON24 H&Q $ MyStocks CSCO Conf Call 11/06 CBS Langlesis News NT Earnings Sports IBM Music MarketClicking on the link for Cisco Analyst Calls displays a listingsorted by date. Semantic filtering uses just the right metadata tomeet screen and other constrains. E.g., Analyst Call focuses onthe source and analyst name or company. The icon denoteadditional metadata, such as “Strong Buy” by H&Q Analyst. HP 86
  87. 87. iTV: Taalee’s Extreme Personalization Immediate Interests, Content Preferences, Provider (DBS, DISH, Wink, AOL-TV) Personalized Content Capsules, Content, Redirects and“Programs” Meta-Data Programming Tagged Content Semantic EngineTM Structured, Hi-Quality Semantic Metabase HP 87
  88. 88. Metadata for Automatic Content Enrichment Interactive Television Part of the screen can be automatically customized toThis screen is customizable show conference call specificwith interactivity feature information– including transcript,using metadata such as whether participation, etc. all of which arethere is a new Conference relevant metadataCall video on CSCO. Conference Call itself can have embedded metadata to support personalization and interactivity. This segment has embedded or referenced metadata that is used by personalization application to show only the stocks that user is interested in. HP 88
  89. 89. Metadata in Enterprise AppsCollection Processing Production SupportSonyNetworkContent CategorizeAffiliateFeeds Catalog IntegratePublicSources Rich Data Metabase Filter, Search, Consolidate, Personalize, Archive, Licensing, Syndication HP 89
  90. 90. Customize: Page Settings | Content | Layout | Color Video A leaking gasoline pipeline burst into flames Thursday, killing -- Breaking News for 11/30/2000 -- more than 60 people near Nigerias commercial capital of Lagos. Many of the dead were fisherman in wooden canoes engulfed in Gore Demands That Recount Restart (9:40 PM) the inferno. Gore Says Fla. Cant Name Electors (4:50 PM) Bush Meets Colin Powell at Ranch (1:22 PM) More than a dozen burned bodies lay on a beach at the village Market Tumbles on Earnings Warning (9:27 AM) of Ebute-Oko facing the central business district of Lagos across a lagoon. Barak Outlines His Peace Plan (6:30 AM) "At least 60 people died in this needless fire," senior local official Karimu Alabi said. Fire crews from state-run Nigerian National Petroleum Corp (NNPC), which owns the pipeline, were joined by other firemen from construction company Julius Berger in battling the blaze. t Residents said the fire started near Ebute-Oko at daybreak and spread rapidly along the line of the oil leak, ravaging a cluster of huts and log houses. Sixty Die In Nigeria Blast At about the same time, a second fire razed Makoko shantytown Produced by: Euronews where thousands of fishermen and their families live in wood Posted Date: 11/30/2000 cabins erected on stilts in the lagoon near Lagos University. Event : Election 2000 Location : Tallahassee, Florida, USA Residents said fishermen from Makoko had been scavenging for People : Al Gore, George W. Bush gasoline from the leaking pipeline and storing it in cans in the wooden huts for days. Many victims of the Ebute-Oke fire were • Greatly enhances news-room productivity and time-to-market • Value-add for production, broadcast & syndication• Taalee’s semantic metadata enables powerful access to content used by Enterprise’s customers HP 90
  91. 91. Description Produced by : CNN Posted Date : 12/07/2000 Reporter : David Lewis Event : Election 2000 Location : Tallahassee, Florida, USA (1.33) – 12/06/00 - ABC People : Al Gore TALLAHASSEE, Florida (CNN) – Though the two presidential candidates (2.53) - 12/06/00 - CBS have until noon Wednesday to file briefs in Al Gores appeal to the Florida Supreme (5.16) - 12/06/00 - ABC Court, the outcome of two trials set on the same day in Leon County, Florida, may offer Gore his best hope for the presidency. (2.46) - 12/06/00 - FOX Democrats in Seminole County are seeking to have 15,000 absentee ballots thrown out (1.33) - 12/06/00 - NBC in that heavily Republican jurisdiction -- a move that would give Gore a lead of up to (5.33) - 12/06/00 -- Breaking News -- 5,000 votes statewide.Gore Demands That Recount Restart (1.33) - 12/06/00 - CBS Lawyers for the plaintiff, Harry Jacobs, claim the ballots should be rejected because they(1.33) - 12/06/00 - ABC say County Elections Supervisor SandraGore Says Fla. Cant Name Electors (3.57) - 12/06/00 - CBS Goard allowed Republican workers to fill out(2.33) - 12/06/00 - CBS voter identification numbers on 2,126 incomplete absentee ballot applications sentBush Meets Colin Powell at Ranch (4.27) - 12/06/00 - ABC in by GOP voters, while refusing to allow(3.12) - 12/06/00 - NNS Democratic workers to do the same thing for Democratic voters.Market Tumbles on Earnings Warning (3.44) - 12/06/00 - FOX(0.32) - 12/06/00 - CBS The GOP says that suit, and one similar to itBarak Outlines His Peace Plan (7.24) - 12/06/00 - CBS from Martin County, demonstrates(1.33) - 12/06/00 - CBS Democratic Party politics at its most desperate. Gore is not a party to either of those lawsuits. On Tuesday, the judge in the HP 91
  92. 92. Metadata’s role in emerging iTV infrastructure Video Enhanced Digital Cable MPEG-2/4/7 MPEG MPEG ☺☺☺ GREAT Encoder Decoder USER EXPERIENCE Create Scene Description Tree Retrieve Scene Description Track Channel sales Node = AVO Object License metadata decoder and through Video Server Vendors, semantic applications toVideo App Servers, and Broadcasters device makers Scene Description Tree Enhanced XML Produced by: Fox Sports Description Creation Date: 12/05/2000 League: NFL Taalee Teams: Seattle Seahawks,“Cisco Systems” Semantic Atlanta Falcons “Cisco Systems” Engine Players: John Kitna Node Coaches: Mike Holmgren, Dan Reeves Metadata-rich Location: Atlanta Value-added Node Object Content Information (OCI) HP 92
  93. 93. Intelligent Metadata Creation Usage Metadata for Intelligent ContentContent which does Content which does not Content the user did contain the words the user asked for + contain the words the user asked for, but + not think to ask for, but which he needs to is about what he asked know. for. Extractor Agents Value-added Metadata Semantic Associations HP 93
  94. 94. Intelligent Content viaValue-Added Metadata HP 94
  95. 95. Value-added Metadata Traditional methods rely solely on (syntactic) indexing of keywords to enable users to access content • If a keyword is not in the content, it cannot be found. • The burden is on the user to think of and ask for the “right” keyword. For example: If a story is about “Roger Clemens” but does not contain the words “New York Yankees”, that story cannot and will not be found if the user searches for “New York Yankees” or “Yankees”. Understanding of the content is needed to create new metadata. Taalee understands Roger Clemens is a PERSON who Plays a SPORT called Baseball for a TEAM from New York called the Yankees.Taalee uses these Semantic Associations (COMPANY participates in INDUSTRY) to add missing metadata to describe content more completely. HP 95
  96. 96. Guided Demo for Value Added Metadata – Example one• Go to & search for Player = Jamal Anderson.• Click on the first result (titled “Week 3 Top10: Anderson TD Run”) and view the metadata on the following RMR page• Here is what you see: Produced by: Posted Date: 9/20/2000 League : NFL Teams : Atlanta Falcons Players : Jamal Anderson• Now click on the button to play the asset (button marked “REAL”)• View the source HTML page that has the original story, and locate this story with the heading “Week 3 top 10: Anderson TD run”• Verify that Team=Atlanta Falcons or League=NFL was not present in the source content.• Taalee attached this value-added metadata to this asset’s existing metadata so that a user searching for Atlanta Falcons will find this story on Jamal Anderson, who is a player of Atlanta Falcons team HP 96
  97. 97. Guided Demo for Value Added Metadata – Example Two• Go to & search for Player = Gary Sheffield• Click on the first result (titled “I want out!”) & view the metadata on the following RMR page• Here is what you see: Produced by: ESPN Posted Date: 3/03/2001 League : National League Teams : Los Angeles Dodgers Players : Gary Sheffield• Now click on the button to play the asset (button marked “REAL”)• View the source HTML page that has the original story, and locate this story with the heading “I want out!”• Verify that Team=Los Angeles Dodgers or League=National League was not present in the source content.• Taalee attached this value-added metadata to this asset’s existing metadata so that a user searching for Los Angeles Dodgers will find this story on Gary Sheffield, who is a player of Los Angeles Dodgers team HP 97
  98. 98. Example 1 – Snapshots (“Jamal Anderson”) Search for ‘JamalAnderson’ in ‘Football’ Click on first result for Jamal Anderson View the original source HTML page. Verify that the source page contains no mention of Team name and League name. They were Taalee’s value- additions to the metadata to facilitate easier search. View metadata. Note that Team name and League name are also included in the metadata HP 98
  99. 99. Example 2 – Snapshots (“Gary Sheffield”) Search for ‘GarySheffield’ in ‘Baseball’ Click on first result for Gary Sheffield View the original source HTML page. Verify that the source page contains no mention of Team name and League name. They were Taalee’s value- additions to the metadata to facilitate easier search. View metadata. Note that Team name and League name are also included in the metadata HP 99
  100. 100. Intelligent Content – Value-Added Metadata Some Metadata are obtained explicitly from the asset. Others (not present in the asset) are added by Taalee using its semantic relationships. League Name of league to which the Name payer’s team belongs – Not mentioned explicitly in asset – Value- The asset is richly, fully described in the many added by Taalee’s processing based on ways the users chose to interact. semantic associations. Posted Rich Media Date Team Name Sports Asset Date of asset posting – Name of team for which Extracted automatically player plays – Not mentioned explicitly in asset – Value-added using Taalee’s Sport semantic relationshipsName of content Name ofprovider that Producer sportproduced the Nameasset Legend: Name of players X Y means mentioned explicitly in Player Taalee uses X to add Y the asset – Extracted Names as value-added metadata to the asset automatically HP 100
  101. 101. Intelligent Content viaSemantic Associations HP 101
  102. 102. Semantic Associations• Traditional search engines rely solely on (syntactic) keywords to find content.• They do not understand the meaning, context, or relationships of keywords.For example: a search engine may see that the word “Commerce One” occurs,but it does not know that Commerce One is a COMPANY which Participates inthe Corporate, Professional & Financial Software INDUSTRY and COMPETESWITH Ariba.As a result, search engines cannot go beyond returning a list (or directory view)of what the user has asked for. Their ability to provide associated information isextremely limited, static, and difficult to scale. Taalee’s Semantic Content Model goes beyond indexing keywords and classifying assets to Understand and Associate all content it catalogs. HP 102
  103. 103. Example (test on Links to news on companies that compete against Commerce One Crucial news on Links to news on companies Commerce One’s Commerce One competes competitors (Ariba) can against Search for company be accessed easily and (To view news on Ariba, click ‘Commerce One’ automatically on the link for Ariba) HP 103
  104. 104. ASP/Enterprise hosted Internal Source 1 Research Extractor 2 Agent 1 World Model Semantic Semantic Consults Engine Application Knowledge Base for Cisco’s competition Lucent story from external 4 feeds picked for Internal Source 2 publishing as Returns result: Extractor Lucent is a “semantically Agent 2 3 competitor of related” to Cisco Cisco story – passed on to Dashboard Story on Cisco 1 Cisco story from PW Source 1 passed on to add semanticExternal feeds/Web associations (e.g. Reuters) Extractor Story on Agent 3 Lucent Taalee Third-party Metabase Content Mgmt And Syndication XCM-compliantMetadata centric metadata, XML or other formatContent Management Architecture HP 104
  105. 105. Semantic Associations supported by Taalee Semantic EngineIntelligent Content = What You Asked for + What you need to know! Related Stock COMPANY Competition COMPANIES in News INDUSTRY withCOMPANIES in Same or Competing PRODUCTSRelated INDUSTRY RegulationsTechnology Impacting INDUSTRY Products EPA EPA or Filed By COMPANY Important to INDUSTRY Industry SEC or COMPANY News HP 105
  106. 106. Semantic Web Application Example: Financial Advisor Research DashboardAutomaticCollation ofsemantically Researchrelated digital Inferredmedia information Automaticallyfrom MultipleSourcesSemanticallyRelated NewsNot Semantic Search/Specifically Personalization, etc.Asked For HP 106
  107. 107. A vision for futureSemantic Web, Complex Relationships and Knowledge Discovery, E.g., InfoQuilt project at LSDIS Lab, Univ. of Georgia
  108. 108. Beyond RDF– one proposal (cf: Ora Lassila)Structural modeling obviously not enough we need a “logic layer” on top of RDF some type of description logic is a possibilityExposing a wide variety of data sources as RDF isuseful, particularly if we have logic/rules which allow usto draw inference from this dataRDF + DL = “Frame System for WWW” Source : HP 108
  109. 109. Semantic Web - next step in Web evolution“A Web in which machine reasoning will be ubiquitous and devastatingly powerful.” [Berners-Lee] “A place where the whim of a human being and the reasoning of a machine coexist in an ideal, powerful mixture.” [Berners-Lee] “A semantic Web would permit more accurate and efficient Web searches, which are among the most important Web-based activities.” [Berners-Lee] A personal definition Semantic Web: The concept that Web-accessible content can be organized semantically, rather than though syntactic and structural methods. HP 109
  110. 110. What is DAML (DARPA Agent Markup Language) a proposal to create technologies that will enable software agents to dynamically identify and understand information sources, and to provide interoperability between agents in a semantic manner. Based on RDF+XML Agent readable Tags
  111. 111. DAML ExampleSource:,4270,2432946,00.html
  112. 112. Three layered Architecture OfSemantic Web Logical Layer Formal Semantics and Reasoning Support – OIL, DAML-O Schema Layer Definition of Vocabulary RDF Schema Data Layer Simple data model and syntax for metadata - RDF
  113. 113. OIL – as RDF Extension<rdfs:Class rdf:ID=”herbivore”> <rdf:type rdf:resource=””/> <rdfs:subClassOf rdf:resource=”#animal”/> <rdfs:subClassOf> <oil:NOT> <oil:hasOperand rdf:resource=”#carnivore”/> </oil:NOT> </rdfs:subClassOf></rdfs:Class>
  114. 114. DAML and OIL – Evolvingtowards Semantic WebOIL Mission OIL is a Web-based representation and inference layer for ontologies, which combines the widely used modeling primitives from frame-based languages with the formal semantics and reasoning services provided by description logics
  115. 115. Knowledge Discovery - ExampleEarthquake Sources Nuclear Test Sources (USGS, NEIC) (Oklahoma Observatory, etc.) Nuclear Test May Cause Earthquakes Is it really true?
  116. 116. Complex RelationshipsA nuclear test could have caused an earthquakeif the earthquake occurred some time after thenuclear test was conducted and in a nearby region. NuclearTest Causes Earthquake <= dateDifference( NuclearTest.eventDate, Earthquake.eventDate ) < 30 AND distance( NuclearTest.latitude, NuclearTest.longitude, Earthquake,latitude, Earthquake.longitude ) < 10000
  117. 117. Knowledge Discovery - Example When was the first recorded nuclear test conducted? 1950Find the total number of earthquakes with a magnitude5.8 or higher on the Richter scale per year starting from 1900 Increase in number of earthquakes since 1945
  118. 118. Knowledge Discovery - Example…For each group of earthquakes with magnitudes in the ranges5.8-6, 6-7, 7-8, 8-9, and >9 on the Richter scale per yearstarting from 1900, find average number of earthquakes Number of earthquakes with magnitude > 7 almost constant. So nuclear tests probably only cause earthquakes with magnitude < 7
  119. 119. Knowledge Discovery - Example…Find pairs of nuclear tests and earthquakes such that the earthequakeoccurred within 30 days after the test was conducted and in a radius of10000 miles from the epicenter of the earthquake Demo
  120. 120. Resources/ www.icestandard.orgMeta Object Facility (MOF) Specification, Version 1.3, September 27, 1999: Metadata Interchange (XMI) Specification, Version 1.1, October 25, 1999: www.daml.orgNEWSML: newsshowcase.reuters.comPRISM: www.vignette.comOIL: www.semanticweb.orgVOICEXML: www.voicexml.orgMPEG7: www.taalee.comOingo:
  121. 121. Multimedia Data Management: UsingMetadata to Integrate and ApplyDigital Media,Amit Sheth and Wolfgang Klas, Eds.,McGraw Hill, ISBN: 0-07-057735-8,1998.