Schema 101Why Metadata Matters: From a Search Engine Perspective. By: Barbara Starr Twitter: @BarbaraStarr Email: bstarr@Ontologica.us
Meta Information ME• Pursued a doctorate in Artificial Intelligence from South My favorite author: Africa in the 80s. Isaac Asimov• Recruited to build intelligent/predictive trading systems on Wall Street• Migrated to government-based contracts, several of which turned into real world products like Favorite book: – SIRI (PAL from DARPA) – WATSON (Acquaint - IBM Watson Labs was a team I Robot member)• From the vantage of a semantic technologist, I keenly watched the evolution of the Semantic Web.• “Shocked into the real world” when working as a consultant @ Overstock• Today - Educator, Consultant, Developer. Favorite character: MULTIVAC By: Barbara Starr Twitter: @BarbaraStarr Email: bstarr@Ontologica.us Linkedin: http://www.linkedin.com/in/barbarastarr
Additional MetainformationFor the purpose of this talk: MY ROBOT or Artificially Intelligent Entity or Search Engine same-as
SEARCH ENGINE POINT OF VIEW How can I exploit metadata or “semantic search”?
SEARCH ENGINE POINT OF VIEW I can directly extract Searchmonkey 2008 information to enhance SERP displays tiles RICH SNIPPETS 2009
SEARCH ENGINE POINT OF VIEW I can search directly on consumed metadata!
SEARCH ENGINE POINT OF VIEW I can provide direct answers to queries by searching on consumed, verified and validated information
SEARCH ENGINE POINT OF VIEW I can even aggregate answers or deduce them (like a timeline of events)
SEARCH ENGINE POINT OF VIEW ? I can detect Penn Treebank tagset relevancy signals: i.e what content to show to what I can even use it in I can use it to audience conjunction with Assist in interpreting a machine learning user query techniques- to eg. Train other components
SEARCH ENGINE POINT OF VIEW Really interesting in terms of exposing long tail content too. It makes I meant the things findable for me beer brewer when pages are published in Arizona with structured markup!
SEARCH ENGINE POINT OF VIEW I could really use Multiple conflicting this stuff. And it vocabularies that I will is like the tower have to align internally of babel out and multiple syntax there! formats as well. ? Microdata Microformats RDFa Goodrelations for e-commerce I’m a Search Engine Robot Prior to Schema.org
SEARCH ENGINE POINT OF VIEW Time to get Serious!
What has been the history?RDFa exploded in 2012 – Source Peter Mika - Yahoo Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats
Current state of metadata on the Web• 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format• By URL: 25% RDFa, 7% microdata, 9% microformat• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers• Especially for RDFa and microdata• See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpora, LDOW 2012
What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Oct 2007 Nov 2007
What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Sept 2008 March 2009
What’s been the HistoryLinked Open Data exploded from 2007 thru 2010 Sussex St. Sept 2010 Reading Andrews NDL Audio- Lists Resource subjects t4gm MySpace scrobbler Lists Moseley (DBTune) (DBTune) RAMEAU Folk NTU SH lobid GTAA Plymouth Resource Lists Organi- Reading Lists sations Music The Open ECS Magna- Brainz Music DB tune Library LCSH South- (Data Brainz LIBRIS ampton Tropes lobid Ulm Incubator) (zitgist) Man- EPrints Resources chester Surge Reading biz. Music RISKS Radio Lists The Open ECS data. John Brainz Discogs Library PSH Gem. UB South- gov.uk Peel (DBTune) FanHubz (Data In- (Talis) Norm- Mann- ampton (DB cubator) Jamendo datei heim RESEX Tune) Popula- Poké- DEPLOY Last.fm tion (En- pédia Artists Last.FM Linked RDF AKTing) research EUTC (DBTune) (rdfize) LCCN VIAF Book Wiki data.gov Produc- Pisa Eurécom P20 Mashup semantic NHS .uk tions classical web.org (EnAKTing) Pokedex (DB Mortality Tune) PBAC ECS (En- AKTing) BBC MARC (RKB Budapest Program Codes Explorer) Energy education OpenEI BBC List Semantic Lotico Revyu OAI (En- CO2 data.gov mes Music Crunch SW AKTing) (En- .uk Chronic- Linked Dog NSZL Base AKTing) ling Event- MDB RDF Food IRIT America Media Catalog ohloh BBC DBLP ACM IBM Good- BibBase Ord- Wildlife (RKB Openly Recht- win nance Finder Explorer) Local spraak. Family DBLP legislation Survey Tele- New VIVO UF .gov.uk nl graphis York flickr (L3S) New- VIVO castle Times URI wrappr Open Indiana RAE2001 UK Post- Burner Calais DBLP codes statistics (FU VIVO CiteSeer Roma data.gov LOIUS Taxon iServe Berlin) IEEE .uk Cornell Concept Geo World data ESD Fact- OS dcs Names book dotAC stan- reference Project Linked Data NASA (FUB) Freebase dards data.gov Guten- .uk for Intervals (Data GESIS Course- transport DBpedia berg STW ePrints CORDIS Incu- ware data.gov bator) (FUB) Fishes ERA UN/ .uk of Texas Geo LOCODE Uberblic Euro- Species The stat dbpedia TCM SIDER Pub KISTI (FUB) lite Gene STITCH Chem JISC London Geo KEGG DIT LAAS Gazette TWC LOGD Linked Daily OBO Drug Eurostat Data UMBEL lingvoj Med (es) Disea- YAGO Medi some Care ChEBI KEGG NSF Linked KEGG KEGG Linked Drug Cpd GovTrack rdfabout Glycan Sensor Data CT Bank Pathway US SEC Open Reactome (Kno.e.sis) riese Uni Cyc Lexvo Path- way PDB Media Semantic totl.net Pfam HGNC XBRL WordNet KEGG KEGG Geographic (VUA) Linked Taxo- CAS Reaction rdfabout Twarql UniProt Enzyme EUNIS Open nomy US Census Publications Numbers PRO- ProDom SITE Chem2 UniRef Bio2RDF User-generated content Climbing WordNet SGD Homolo Linked (W3C) Affy- Gene GeoData Cornetto metrix Government PubMed Gene UniParc Ontology GeneID Cross-domain Airports Product DB UniSTS MGI Gen Life sciences Bank OMIM InterPro As of September 2010 LOD Cloud
Timeline of RDFa and Semantic Web AdoptionAs of Semtech 2011 Inevitable passage of Semantic Web adoption – culminating in schema.org
SEARCH ENGINE POINT OF VIEW A Search Engine alliance has the power to MANDATE Align and consume vocabulary and syntax! many vocabularies that may not be of interest to search engines? Rather mandate vocabulary And Syntax - microdata
SEARCH ENGINE POINT OF VIEW On the other hand – Not wise to ignore standards bodies like W3C No mandate on Syntax
SEARCH ENGINE POINT OF VIEW Did I tell you I don’t like spam?
SEARCH ENGINE POINT OF VIEW Ensure your data feeds match information with the structured markup or Make sure you are “metadata” on not cloaking by your web pages. feeding one set of information to me and another to human users!
SEARCH ENGINE POINT OF VIEW Serving RELEVANT ANSWERS are IMPERATIVE! & central to my very being! Your Logo
SEARCH ENGINE POINT OF VIEW Adding context in search verticals really Google’s “SearchVerticals” helps me serve up relevant information Notice any correlations? (Seriously increases my I would advise you to! recall), as does geospatial information. Consumed information - Structured Data Dashboard
SEARCH ENGINE POINT OF VIEW “Amazing fact: same amount of computing to answer one Google Search query as all the computing done -- in flight and on the ground -- for the entire Apollo program! I also have a pretty good understanding of big data and web intelligence so I can leverage them! SIRI OH! and be sure to check out Moores law
SEARCH ENGINE POINT OF VIEW I can combine it with computer vision techniques. I can leverage metadata for better image search SIRI I can enhance user’s shopping experience.
SEARCH ENGINE POINT OF VIEW Symbolic reasoning vs stochastic reasoning (Latter is more like NLP or Know rather than page rank) Recognize? INTRODUCING THE KNOWLEDGE GRAPH
And if you thoughtSEARCH ENGINE POINT OF VIEW the knowledge graph was cool, Talk of increase in checkout the screen real estate knowledge and CTR? carousel!
SEARCH ENGINE POINT OF VIEW Thank you for your time! And just a bye-the-bye, this technology is still in it’s nascent stages. Can Resources to help you imagine what I will you! Make sure to be able to do soon? use them wisely! Barbara Starr Email: firstname.lastname@example.org Twitter: @BarbaraStarr
Resources at this point in timeGoodrelations: Resources, generators, validators, more, ….
Other Semantic Web ResourcesCaveat: Some training may be required for some of the tools OpenCalais – Can extract information about people, places and things AlchemyAPI – named entity extraction, topic recognition, keyword tagging, more …. Cogito – Expert System Franz Inc. – Gruff Many More…. Barbara Starr Twitter: @BarbaraStarr Email: bstarr@Ontologica.usFor more info contact: Linkedin: http://www.linkedin.com/in/barbarastarr
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.