Published on

Schema 101, Why the New Metadata Matters. "From a Search Engine Perspective"
SMX East 2012

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. Schema 101Why Metadata Matters: From a Search Engine Perspective. By: Barbara Starr Twitter: @BarbaraStarr Email:
  2. Meta Information ME• Pursued a doctorate in Artificial Intelligence from South My favorite author: Africa in the 80s. Isaac Asimov• Recruited to build intelligent/predictive trading systems on Wall Street• Migrated to government-based contracts, several of which turned into real world products like Favorite book: – SIRI (PAL from DARPA) – WATSON (Acquaint - IBM Watson Labs was a team I Robot member)• From the vantage of a semantic technologist, I keenly watched the evolution of the Semantic Web.• “Shocked into the real world” when working as a consultant @ Overstock• Today - Educator, Consultant, Developer. Favorite character: MULTIVAC By: Barbara Starr Twitter: @BarbaraStarr Email: Linkedin:
  3. Additional MetainformationFor the purpose of this talk: MY ROBOT or Artificially Intelligent Entity or Search Engine same-as
  4. SEARCH ENGINE POINT OF VIEW How can I exploit metadata or “semantic search”?
  5. SEARCH ENGINE POINT OF VIEW I can directly extract Searchmonkey 2008 information to enhance SERP displays tiles RICH SNIPPETS 2009
  6. SEARCH ENGINE POINT OF VIEW I can search directly on consumed metadata!
  7. SEARCH ENGINE POINT OF VIEW I can provide direct answers to queries by searching on consumed, verified and validated information
  8. SEARCH ENGINE POINT OF VIEW I can even aggregate answers or deduce them (like a timeline of events)
  9. SEARCH ENGINE POINT OF VIEW ? I can detect Penn Treebank tagset relevancy signals: i.e what content to show to what I can even use it in I can use it to audience conjunction with Assist in interpreting a machine learning user query techniques- to eg. Train other components
  10. SEARCH ENGINE POINT OF VIEW Really interesting in terms of exposing long tail content too. It makes I meant the things findable for me beer brewer when pages are published in Arizona with structured markup!
  11. SEARCH ENGINE POINT OF VIEW I could really use Multiple conflicting this stuff. And it vocabularies that I will is like the tower have to align internally of babel out and multiple syntax there! formats as well. ? Microdata Microformats RDFa Goodrelations for e-commerce I’m a Search Engine Robot Prior to
  12. SEARCH ENGINE POINT OF VIEW Time to get Serious!
  13. What has been the history?RDFa exploded in 2012 – Source Peter Mika - Yahoo Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
  14. Current state of metadata on the Web• 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format• By URL: 25% RDFa, 7% microdata, 9% microformat• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers• Especially for RDFa and microdata• See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
  15. What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Oct 2007 Nov 2007
  16. What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Sept 2008 March 2009
  17. What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Sussex St. Sept 2010 Reading Andrews NDL Audio- Lists Resource subjects t4gm MySpace scrobbler Lists Moseley (DBTune) (DBTune) RAMEAU Folk NTU SH lobid GTAA Plymouth Resource Lists Organi- Reading Lists sations Music The Open ECS Magna- Brainz Music DB tune Library LCSH South- (Data Brainz LIBRIS ampton Tropes lobid Ulm Incubator) (zitgist) Man- EPrints Resources chester Surge Reading biz. Music RISKS Radio Lists The Open ECS data. John Brainz Discogs Library PSH Gem. UB South- Peel (DBTune) FanHubz (Data In- (Talis) Norm- Mann- ampton (DB cubator) Jamendo datei heim RESEX Tune) Popula- Poké- DEPLOY tion (En- pédia Artists Last.FM Linked RDF AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki Produc- Pisa Eurécom P20 Mashup semantic NHS .uk tions classical (EnAKTing) Pokedex (DB Mortality Tune) PBAC ECS (En- AKTing) BBC MARC (RKB Budapest Program Codes Explorer) Energy education OpenEI BBC List Semantic Lotico Revyu OAI (En- CO2 mes Music Crunch SW AKTing) (En- .uk Chronic- Linked Dog NSZL Base AKTing) ling Event- MDB RDF Food IRIT America Media Catalog ohloh BBC DBLP ACM IBM Good- BibBase Ord- Wildlife (RKB Openly Recht- win nance Finder Explorer) Local spraak. Family DBLP legislation Survey Tele- New VIVO UF nl graphis York flickr (L3S) New- VIVO castle Times URI wrappr Open Indiana RAE2001 UK Post- Burner Calais DBLP codes statistics (FU VIVO CiteSeer Roma LOIUS Taxon iServe Berlin) IEEE .uk Cornell Concept Geo World data ESD Fact- OS dcs Names book dotAC stan- reference Project Linked Data NASA (FUB) Freebase dards Guten- .uk for Intervals (Data GESIS Course- transport DBpedia berg STW ePrints CORDIS Incu- ware bator) (FUB) Fishes ERA UN/ .uk of Texas Geo LOCODE Uberblic Euro- Species The stat dbpedia TCM SIDER Pub KISTI (FUB) lite Gene STITCH Chem JISC London Geo KEGG DIT LAAS Gazette TWC LOGD Linked Daily OBO Drug Eurostat Data UMBEL lingvoj Med (es) Disea- YAGO Medi some Care ChEBI KEGG NSF Linked KEGG KEGG Linked Drug Cpd GovTrack rdfabout Glycan Sensor Data CT Bank Pathway US SEC Open Reactome (Kno.e.sis) riese Uni Cyc Lexvo Path- way PDB Media Semantic Pfam HGNC XBRL WordNet KEGG KEGG Geographic (VUA) Linked Taxo- CAS Reaction rdfabout Twarql UniProt Enzyme EUNIS Open nomy US Census Publications Numbers PRO- ProDom SITE Chem2 UniRef Bio2RDF User-generated content Climbing WordNet SGD Homolo Linked (W3C) Affy- Gene GeoData Cornetto metrix Government PubMed Gene UniParc Ontology GeneID Cross-domain Airports Product DB UniSTS MGI Gen Life sciences Bank OMIM InterPro As of September 2010 LOD Cloud
  18. Timeline of RDFa and Semantic Web AdoptionAs of Semtech 2011 Inevitable passage of Semantic Web adoption – culminating in
  19. SEARCH ENGINE POINT OF VIEW A Search Engine alliance has the power to MANDATE Align and consume vocabulary and syntax! many vocabularies that may not be of interest to search engines? Rather mandate vocabulary And Syntax - microdata
  20. Sample portion
  21. SEARCH ENGINE POINT OF VIEW On the other hand – Not wise to ignore standards bodies like W3C No mandate on Syntax
  22. SEARCH ENGINE POINT OF VIEW Did I tell you I don’t like spam?
  23. SEARCH ENGINE POINT OF VIEW Ensure your data feeds match information with the structured markup or Make sure you are “metadata” on not cloaking by your web pages. feeding one set of information to me and another to human users!
  24. SEARCH ENGINE POINT OF VIEW Serving RELEVANT ANSWERS are IMPERATIVE! & central to my very being! Your Logo
  27. SEARCH ENGINE POINT OF VIEW Adding context in search verticals really Google’s “SearchVerticals” helps me serve up relevant information Notice any correlations? (Seriously increases my I would advise you to! recall), as does geospatial information. Consumed information - Structured Data Dashboard
  28. SEARCH ENGINE POINT OF VIEW “Amazing fact: same amount of computing to answer one Google Search query as all the computing done -- in flight and on the ground -- for the entire Apollo program! I also have a pretty good understanding of big data and web intelligence so I can leverage them! SIRI OH! and be sure to check out Moores law
  29. SEARCH ENGINE POINT OF VIEW I can combine it with computer vision techniques. I can leverage metadata for better image search SIRI I can enhance user’s shopping experience.
  30. SEARCH ENGINE POINT OF VIEW Symbolic reasoning vs stochastic reasoning (Latter is more like NLP or Know rather than page rank) Recognize? INTRODUCING THE KNOWLEDGE GRAPH
  31. And if you thoughtSEARCH ENGINE POINT OF VIEW the knowledge graph was cool, Talk of increase in checkout the screen real estate knowledge and CTR?  carousel!
  32. SEARCH ENGINE POINT OF VIEW Thank you for your time!  And just a bye-the-bye, this technology is still in it’s nascent stages. Can Resources to help you imagine what I will you! Make sure to be able to do soon? use them wisely! Barbara Starr Email: Twitter: @BarbaraStarr
  33. Resources at this point in timeCaveat: Some training may be required for some of the toolsProgramming Languages: Publishing Platforms:JavaSCript:Microdatajs Form Based tools: DrupalLive microdata Schema Creator JoomlaPhp: Microdataphp Microdata generator Wordpress (about 7 of them)Ruby: RDF Microdata Standalone tools Virtuoso RDF Lib plugin Web.instadata Topbraid Composer Editors:PerlRuby: RDF Microdata Gem Topbraid ComposerMida ProtegeJava: Sindice any23 libraryValidators, Testers and More Check.rdfa.infoSindice InspectorRich Snippets Testing Tool Bing ValidatorStructured data Linter Online Parser?viewer and RSS generatorValidator.nuGoogle Structured Data Tester
  34. Resources at this point in timeGoodrelations: Resources, generators, validators, more, ….
  35. Resources From the mouth of
  36. Franz new toolSoon to be released for SEO
  37. Other Semantic Web ResourcesCaveat: Some training may be required for some of the tools OpenCalais – Can extract information about people, places and things AlchemyAPI – named entity extraction, topic recognition, keyword tagging, more …. Cogito – Expert System Franz Inc. – Gruff Many More…. Barbara Starr Twitter: @BarbaraStarr Email: bstarr@Ontologica.usFor more info contact: Linkedin: