Making things findable

Making things findablePeter Mika Researcher and Data ArchitectYahoo! Inc.

This was planned to be a presentation (only) about Semantic Search…But we are celebrating!

Scientific American articleSemantic Web Working Symposium at StanfordSemantic Web standardization beginsEU funding! DAML funding!The Semantic Web starts a career… and so am I…Anno 2001: The birth

Anno 2004-2006: Reality sets in Agents? Semantic Web Services? Common sense? Engineers are not logiciansHumans will have to do (most of) the workTwo separate visionsNo one actually invests… other than the EUWhere are the ontologies?Bad reputation for SW researchThe Semantic Web is looking for a job… and so am I 

Anno 2007: A second chanceData first, schema second…logic thirdLinked Data…. billions of triplesA sense of communitySome ontologiesWe get the standards we needStartups, tool vendors appearSemTechThe Semantic Web is in business… and so am I!

Search is really fast, without necessarily being intelligent

Why Semantic Search? Part IImprovements in IR are harder and harder to come byMachine learning using hundreds of featuresText-based features for matchingGraph-based features provide authorityHeavy investment in computational power, e.g. real-time indexing and instant searchRemaining challenges are not computational, but in modeling user cognitionNeed a deeper understanding of the query, the content and/or the world at largeCould Watson explain why the answer is Toronto?

Poorly solved information needsMultiple interpretationsparishiltonLong tail queriesgeorge bush (and I mean the beer brewer in Arizona)Multimedia searchparishilton sexyImprecise or overly precise searches jimhendlerpictures of strong adventures peopleSearches for descriptionscountries in africa32 year old computer scientist living in barcelonareliable digital camera under 300 dollarsMany of these queries would not be asked by users, who learned over time what search technology can and can not do.

Dealing with sparse collectionsNote: don’t solve the sparsity problem where it doesn’t exist

Contextual Search: content-based recommendationsHovering over an underlined phrase triggers a search for related news items.

Contextual Search: personalizationMachine Learning based ‘search’ algorithm selects the main story and the three alternate stories based on the users demographics (age, gender etc.) and previous behavior. Display advertizing is a similar top-1 search problem on the collection of advertisements.

Contextual Search: new devicesShow related contentConnect to friends watching the same

Aggregation across different dimensionsHyperlocal: showing content from across Yahoo that is relevant to a particular neighbourhood.

Why Semantic Search? Part IIThe Semantic Web is now a realityBillions of triplesThousands of schemasVarying qualityEnd users Keyword queries, not SPARQLSearching data instead or in addition to searching documentsProviding direct answers (possibly with explanations)Support for novel search tasks

Information box with content from and links to Yahoo! TravelDirect answers in searchPoints of interest in Vienna, AustriaProducts from Yahoo! ShoppingSince Aug, 2010, ‘regular’ search results are ‘Powered by Bing’

Novel search tasksAggregation of search resultse.g. price comparison across websitesAnalysis and predictione.g. world temperature by 2020Semantic profilingrecommendations based on particular interestsSemantic log analysisunderstanding user behavior in terms of objects Support for complex tasks (search apps)e.g. booking a vacation using a combination of services

Why Semantic Search? Part IIIThere is a modelPublishers are (increasingly) interested in making their content searchable, linkable and easier to aggregate and reuseSo that social media sites and search engines can expose their content to the right users, in a rich and attractive formThis is a about creating an ecosystem… More advanced in some domains In others, we still live in the tyranny of rate-limited APIs, licenses etc.

Example: rNewsRDFa vocabulary for news articlesEasier to implement than NewsMLEasier to consume for news search and other readers, aggregatorsUnder development at the IPTCMarch: v0.1 approvedFinal version by Sept

Example: Facebook’s Like and the Open Graph ProtocolThe ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities Shows up in profiles and news feedSite owners can later reach users who have liked an objectFacebook Graph API allows 3rd party developers to access the data Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’

Example: Facebook’s Open Graph ProtocolRDF vocabulary to be used in conjunction with RDFaSimplify the work of developers by restricting the freedom in RDFaActivities, Businesses, Groups, Organizations, People, Places, Products and EntertainmentOnly HTML <head> acceptedhttp://opengraphprotocol.org/<html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> …</head> ...

Semantic Search: a definition Semantic search is a retrieval paradigm thatMakes use of the structure of the data or explicit schemas to understand user intent and the meaning of contentExploits this understanding at some part of the search processCombination of document and data retrievalDocuments with metadataMetadata may be embeddedinside the documentI’m looking for documents that mention countries in Africa.Data retrievalStructured data, but searchable text fieldsI’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.

Semantics at every step of the IR processθ(q,d)“bla”blablablablablablaThe IR engineThe Webbla bla bla?Query interpretationq=“bla” * 3Document processingIndexingRankingblablaResult presentationbla

Data on the WebData on the Web as a complement to professional data providersMost web pages are generated from databases, but the data in not always directly accessibleAPIs offer limited views over dataThe structure and semantics (meaning) of the data is not directly accessible to search enginesTwo solutionsExtraction using Information Extraction (IE) techniques (implicit metadata)Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata)

Information Extraction methodsNatural Language ProcessingExtraction of triplesSuchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007.Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007.Filling web forms automatically (form-filling)Madhavan et al. Google's Deep-Web Crawl. VLDB 2008Extraction from HTML tablesCafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008Wrapper inductionKushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007

Semantic WebSharing data across the WebPublish information in standard formats (RDF, RDFa)Share the meaning using powerful, logic-based languages (OWL, RIF)Query using standard languages and protocols (HTTP, SPARQL)Two main forms of publishingLinked DataData published as RDF documents linked to other RDF documents and/or using SPARQL end-pointsCommunity effort to re-publish large public datasets (e.g. Dbpedia, open government data)RDFaData embedded inside HTML pagesRecommended for site owners by Yahoo, Google, Facebook

Linked Data: interlinked RDF documentsFriend-of-a-Friend ontologyRoi’s homepagetypeexample:roifoaf:Personname“Roi Blanco”sameAsYahootypeworksWithexample:roi2example:peteremail“pmika@yahoo-inc.com”

…<p typeof=”foaf:Person" about="http://example.org/roi"> <span property=”foaf:name”>Roi Blanco</span>. <a rel=”owl:sameAs" href="http://research.yahoo.com/roi"> Roi Blanco </a>. You can contact him at <a rel=”foaf:mbox" href="mailto:roi@yahoo-inc.com"> via email </a>. </p> ... RDFa: metadata embedded in HTMLRoi’s homepage

Crawling the Semantic WebLinked DataSimilar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawledSemantic Sitemap/VOID descriptionsRDFaSame as HTML crawling, but data is extracted after crawlingMika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010.SPARQL endpointsEndpoints are not linked, need to be discovered by other meansSemantic Sitemap/VOID descriptions

Data fusionOntology matchingWidely studied in Semantic Web research, see e.g. list of publications at ontologymatching.orgUnfortunately, not much of it is applicable in a Web context due to the quality of ontologiesEntity resolutionLogic-based approaches in the Semantic WebStudied as record linkage in the database literature Machine learning based approaches, focusing on attributesGraph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF dataImprovements over only attribute based matchingBlendingMerging objects that represent the same real world entity and reconciling information from multiple sources

Data quality assessment and curationHeterogeneity, quality of data is an even larger issueQuality ranges from well-curated data sets (e.g. Freebase) to microformats In the worst of cases, the data becomes a graph of wordsShort amounts of text: prone to mistakes in data entry or extractionExample: mistake in a phone number or state codeQuality assessment and data curationQuality varies from data created by experts to user-generated contentAutomated data validationAgainst known-good data or using triangulationValidation against the ontology or using probabilistic modelsData validation by trained professionals or crowdsourcingSampling data for evaluationCuration based on user feedback

Query interpretation in search Provide a higher level representation of queries in some conceptual space Ideally, the same space in which documents are representedLimited user involvement in the case document retrievalExamples: search assist, facetsInterpretation happens before the query is executedFederation: determine where to send the queryExample: show business listings from Yahoo! Local for local queriesRanking featureBlend multiple possible interpretations of the same queryDeals with the sparseness of query streams88% of unique queries are singleton queries (Baeza et al)Spell correction (“we have also included results for…”), stemming

Query interpretation in Semantic Search“Snap to grid”Using the ontologies, a summary of the data or the whole data we find the most likely structured query matching the user inputExample: “starbucks nyc” -> company name:”Starbucks” in location:”New York City”Larger user involvementGuiding the user in constructing the query Example: Freebase SuggestDisplaying back the interpretation of the queryExample: TrueKnowledge

IndexingSearch requires matching and rankingMatching selects a subset of the elements to be scoredThe goal of indexing is to speed up matchingRetrieval needs to be performed in millisecondsWithout an index, retrieval would require streaming through the collectionThe type of index depends on the query language to supportSPARQL is a highly-expressive SQL-like query language for expertsDB-style indexingEnd-users are accustomed to keyword queries with very limited structure (see Pound et al. WWW2010)IR-style indexing

IR-style indexingIndex data as textCreate virtual documents from dataOne virtual document per subgraph, resource or tripletypically: resourceKey differences to Text RetrievalRDF data is structuredMinimally, queries on property values are requiredMapReduce is an ideal model for building inverted indicesMap creates (term, {doc1}) pairsReduce collects all docs for the same term: (term, {doc1, doc2…}Sub-indices are merged separatelyTerm-partitioned indicesPeter Mika. Distributed Indexing for Semantic Search, SemSearch 2010.

Horizontal index structureTwo fields (indices): one for terms, one for propertiesFor each term, store the property on the same position in the property indexPositions are required even without phrase queriesQuery engine needs to support the alignment operator✓ Dictionary is number of unique terms + number of propertiesOccurrences is number of tokens * 2Vertical index structureOne field (index) per propertyPositions are not requiredBut useful for phrase queriesQuery engine needs to support fieldsDictionary is number of unique terms

Occurrences is number of tokens✗ Number of fields is a problem for merging, query performance

RankingPreviously, expert applications using specialized datasetsQueries given as logical formulas or highly selective DB-style queriesExpert users Small, high quality datasetsPossible to give a precise answer (question-answering)Increasingly, end-user applications using Web dataUsing keyword queries at least as a starting pointNon-expert usersLarge datasets with potential mistakesNot possible to give precise answers, only to provide relevant answers

Ranking methodsThe unit of retrieval is either an object or a sub-graphThe sub-graph induced by matching keywords in the query Ranking methods from Information RetrievalTF-IDF, BM25, probabilistic methods (language models)Methods such as BM25F allow weighting by predicateMachine learning is used to tune parameters and incorporate query-independent (static) featuresExample: authority scores of datasets computed using PageRank (Harth et al. ISWC 2009)Additional topicsRelevance feedback and personalizationClick-based rankingDe-duplication Diversification

EvaluationHarry Halpin, Daniel Herzig, Peter Mika, Jeff Pound, Henry Thompson, Roi Blanco, Thanh Tran Duc

Semantic Search evaluation at SemSearch 2010 and 2011Evaluation is a critical component in developing IR systemsKeyword search over RDF dataEntity SearchQueries where the user is looking for a single real world objectPound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010.List search (new in 2011)Queries where the user is looking for a class of objectsFocus on relevance, not efficiencyReal queries and real dataYahoo! and Microsoft query logsBillion Triples Challenge 2009 datasetTREC style evaluation Focusing on ranking, not question answeringEvaluated using Amazon’s Mechanical Turk

Hosted by Yahoo! Labs at semsearch.yahoo.com

Assessment with Amazon Mechanical TurkEvaluation using non-expert judgesPaid $0.2 per 12 resultsTypically done in 1-2 minutes$6-$12 an hourSponsored by the European SEALS projectEach result is evaluated by 5 workersWorkers are free to choose how many tasks they doMakes agreement difficult to computeNumber of tasks completed per worker (2010)

Catching the bad guysPayment can be rejected for workers who try to game the systemAn explanation is commonly expected, though cheaters rarely complainWe opted to mix control questions into the real resultsGold-win cases that are known to be perfectGold-loose cases that are known to be badMetricsAvg. and std. dev on gold-win and gold-loose resultsTime to complete

Results5-6 teams participated in both 2010 and 2011Web queries and web data… difficult task!The Semantic Web is not (necessarily) all DBpediaThough a search engine on Dbpedia only would have done wellComputing agreement is difficultEach judge evaluates a different number of itemsFollow up experiments validated the Mechanical Turk approachBlanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011Result rendering is importantInfluences the perception of relevanceNo ‘canonical’ result rendering in Semantic Web searchAssessments are made publicMeasure your own system against the best of the field

Next stepsNew tasksMore complex queries (but finding them is difficult)Retrieval on RDFa dataAggregated searchRanking properties of objectsAchieve repeatabilitySimplify our process and publish our toolsAutomate as much as possible… except the Turks ;)Positioning compared to other evaluation campaignsTREC Entity TrackQuestion Answering over Linked DataSEALS campaignsJoin the discussion at semsearcheval@yahoogroups.com

Search InterfaceSemantic Search brings improvements inSnippet generationAdaptive and interactive presentationPresentation adapts to the kind of query and results presentedObject results can be actionable, e.g. buy this productUser can provide feedback on particular objects or data sourcesAggregated searchGrouping similar items, summarizing results in various waysFiltering (facets), possibly across different dimensionsQuery and task templatesHelp the user to fulfill the task by placing the query in a task context

Snippet generation using metadataYahoo displays enriched search results for pages that contain microformat or RDFa markup using recognized ontologiesDisplaying data, images, videoExample: GoodRelations for productsEnhanced results also appear for sites from which we extract information ourselvesAlso used for generating facets that can be used to restrict search results by object typeExample: “Shopping sites” facet for productsDocumentation and validator for developershttp://developer.search.yahoo.comFormerly: SearchMonkey allowed developers to customize the result presentation and create new ones for any object typeHaas et al. Enhanced Results in Web Search. SIGIR 2011

Example: Yahoo! Enhanced ResultsEnhanced result with deep links, rating, address.

Example: Yahoo! Vertical Intent SearchRelated actors and movies

Making things findable

More Related Content

What's hot

Similar to Making things findable

More from Peter Mika

Recently uploaded

Making things findable

Editor's Notes