View stunning SlideShares in full-screen with the new iOS app!Introducing SlideShare for AndroidExplore all your favorite topics in the SlideShare appGet the SlideShare app to Save for Later — even offline
View stunning SlideShares in full-screen with the new Android app!View stunning SlideShares in full-screen with the new iOS app!
Related Entity Findingon the WebPeter MikaSenior Research ScientistYahoo! ResearchJoint work with B. Barla Cambazoglu and Roi Blanco
- 2 -Search is really fast, without necessarily being intelligent
- 3 -Why Semantic Search? Part I• Improvements in IR are harder and harder to come by– Machine learning using hundreds of features• Text-based features for matching• Graph-based features provide authority– Heavy investment in computational power, e.g. real-timeindexing and instant search• Remaining challenges are not computational, but inmodeling user cognition– Need a deeper understanding of the query, the content and/orthe world at large– Could Watson explain why the answer is Toronto?
- 8 -Why Semantic Search? Part II• The Semantic Web is here– Data• Large amounts of RDF data• Heterogeneous schemas• Diverse quality– End users• Not skilled in writing complexqueries (e.g. SPARQL)• Not familiar with the data• Novel applications– Complementing document search• Rich Snippets, related entities,direct answers– Other novel search tasks
- 9 -Semantic Web data• Linked Data– Data published as RDF documentslinked to other RDF documents and/orusing SPARQL end-points– Community effort to re-publish largepublic datasets (e.g. Dbpedia, opengovernment data)• RDFa and microdata– Data embedded inside HTML pages– Schema.org collaboration among Bing,Google, Yahoo and Yandex– Facebook Open Graph Protocol (OGP)
- 10 -Other novel applications• Aggregation of search results– e.g. price comparison across websites• Analysis and prediction– e.g. world temperature by 2020• Semantic profiling– Ontology-based modeling of user interests• Semantic log analysis– Linking query and navigation logs to ontologies• Task completion– e.g. booking a vacation using a combination of services• Conversational search– e.g. PARLANCE EU FP7 projectWeb usage miningwith Semantic AnalysisFri 3pmWeb usage miningwith Semantic AnalysisFri 3pm
- 13 -Semantic Search: a definition• Semantic search is a retrieval paradigm that– Makes use of the structure of the data or explicit schemas tounderstand user intent and the meaning of content– Exploits this understanding at some part of the search process• Emerging field of research– Exploiting Semantic Annotations in Information Retrieval (2008-2012)– Semantic Search (SemSearch) workshop series (2008-2011)– Entity-oriented search workshop (2010-2011)– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012)– SIGIR 2012 tracks on Structured Data and Entities• Related fields:– XML retrieval, Keyword search in databases, NL retrieval
- 14 -Search is required in the presence of ambiguityQueryDataKeywordsKeywordsNLQuestionsNLQuestionsForm- / facet-based InputsForm- / facet-based InputsStructured Queries(SPARQL)Structured Queries(SPARQL)OWL ontologies withrich, formalsemanticsOWL ontologies withrich, formalsemanticsStructuredRDF dataStructuredRDF dataSemi-StructuredRDF dataSemi-StructuredRDF dataRDF dataembedded intext (RDFa)RDF dataembedded intext (RDFa)Ambiguities: interpretationAmbiguities: interpretation, extraction errors, data quality, confidence/trust
- 17 -Motivation• Some users are short on time– Need for direct answers– Query expansion, question-answering, information boxes, richresults…• Other users have time at their hand– Long term interests such as sports, celebrities, movies andmusic– Long running tasks such as travel planning
- 19 -Spark: related entity recommendations in web search• A search assistance tool for exploration• Recommend related entities given the user’s current query– Cf. Entity Search at SemSearch, TREC Entity Track• Ranking explicit relations in a Knowledge Base– Cf. TREC Related Entity Finding in LOD (REF-LOD) task• A previous version of the system live since 2010• van Zwol et al.: Faceted exploration of image search results. WWW2010: 961-970
- 27 -Entity graph challenges• Coverage of the query volume– New entities and entity types– Additional inference– International data– Aliases, e.g. jlo, big apple, thomas cruise mapother iv• Freshness– People query for a movie long before it’s released• Irrelevant entity and relation types– E.g. voice actors who co-acted in a movie, cities in a continent• Data quality– United States Senate career of Barack Obama is not a person– Andy Lau has never acted in Iron Man 3
- 29 -Feature extraction from text• Text sources– Query terms– Query sessions– Flickr tags– Tweets• Common representationInput tweet:Brad Pitt married to Angelina Jolie in Las VegasOutput event:Brad Pitt + Angelina JolieBrad Pitt + Las VegasAngelina Jolie + Las Vegas
- 30 -Features• Unary– Popularity features from text: probability, entropy, wiki idpopularity …– Graph features: PageRank on the entity graph, wikipedia, webgraph– Type features: entity type• Binary– Co-occurrence features from text: conditional probability, jointprobability …– Graph features: common neighbors …– Type features: relation type
- 31 -Feature extraction challenges• Efficiency of text tagging– Hadoop Map/Reduce• More features are not always better– Can lead to over-fitting without sufficient training data
- 33 -Model Learning• Training data created by editors (five grades)400 Brandi adriana lima Brad Pitt person Embarassing1397 David H. andy garcia Brad Pitt person Mostly Related3037 Jennifer benicio del toro Brad Pitt person Somewhat Related4615 Sarah burn after reading Brad Pitt person Excellent9853 Jennifer fight club movie Brad Pitt person Perfect• Join between the editorial data and the feature file• Trained a regression model using GBDT–Gradient Boosted Decision Trees• 10-fold cross validation optimizing NDCG and tuning•number of trees•number of nodes per tree
- 35 -Impact of training dataNumber of training instances (judged relations)
- 36 -Performance by query-entity type•High overall performance but some types are more difficult•Locations– Editors downgrade popular entities such as businessesNDCG by type of the query entity
- 37 -Model Learning challenges• Editorial preferences not necessarily coincide with usage– Users click a lot more on people than expected– Image bias?• Alternative: optimize for usage data– Clicks turned into labels or preferences– Size of the data is not a concern– Gains are computed from normalized CTR/COEC– See van Zwol et al. Ranking Entity Facets Based on User ClickFeedback. ICSC 2010: 192-199.couple of hundred entities and their facets we ﬁnd thatlinear combination of the conditional probabilities givest performance on the collected judgements using wqt = 2,= 0.5, and wf t = 1. However, the editorial data was notstantial enough to learn a ranking with GBDT.Click-through Rate versus Click over Expected ClickFrom the image search query logs, we collect the user clicka that is related to the facets. This allows us to compute theck-through rate (CTR) on a facet for a given entity that isected in a user query and for which the facets were shownhe user. Let clickse,f be the number of clicks on a facetty f show in relation to entity e, and viewse,f the numbertimes the facet f is shown to a user for a related entity e,n the probability of a click on a facet entity f for a giventy e can be modelled as ctre,f :ctre,f =clickse,fviewse,f(2)n Figure 3 the conditional click-through rate is shown forﬁrst ten positions. It shows the CTR per position for everyge view where one of the facets is clicked, aggregated overcoece,f =clPPp=1 viZhang and Jones  refer toexpected clicks, based on the deexpected clicks given the positioC. Gradient Boosted Decision TrStochastic gradient boosted decthe most widely used learning algtoday. Gradient tree boosting cosion model, utilizing decision trOne advantage over other learntrees in general is that the featare highly interpretable. GBDTdifferent loss functions can be uresearch presented here we usedour loss function. In related work,pairwise and ranking speciﬁc lowell at improving search relevancshallow decision trees, trees in son a randomly selected subset ofprone to over-ﬁtting . For theshown in the search enginehe ground truth for creatingset used by the gradientonal Probabilitiesof the facets search expe-unction rank(e, f) that isonal probabilities extracted⇥Pqs(f|e)+wf t ⇥Pf t (f, e)(1)e) are the conditional prob-he weights for the differentqt), query session (qs) andl judgements collected fortheir facets we ﬁnd thatditional probabilities givesudgements using wqt = 2,the editorial data was notng with GBDT.k over Expected Clickgs, we collect the user clickall entities. Observe that the CTR declines when the positionat which a facet is shown increases.We introduce a second click model, based on the notionof clicks over expected clicks (COEC). To allows us to dealwith the so called position bias – where facets appearing inlower positions are less likely to be clicked even if they arerelevant . This phenomenon isoften observed in Web searchand we adopt the COEC model proposed by Chapelle andZhang . In that model, we estimate ctrp as the aggregatedctr – over all queries and sessions – in position p for allpositions P. Let then clickse,f be the number of clicks ona facet entity f show in relation to entity e, and viewse,f pthenumber of times the facet f is shown to a user for a relatedentity e at position p. The probability of a click over expectedclick on a facet entity f for a given entity e can then bemodelled as coece,f :coece,f =clickse,fPPp=1 viewse,f p ⇥ ctrp(3)Zhang and Jones  refer to this method as clicks overexpected clicks, based on the denominator that includes theexpected clicks given the positions that the url appeared in.C. Gradient Boosted Decision TreesStochastic gradient boosted decision trees (GBDT) is one of
- 38 -EntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeaturesRanking and Disambiguation
- 39 -Ranking and Disambiguation• We apply the ranking function offline to the data• Disambiguation– How many times a given wiki id was retrieved for queries containing the entityname?Brad Pitt Brad_Pitt 21158Brad Pitt Brad_Pitt_(boxer) 247XXX XXX_(movie) 1775XXX XXX_(Asia_album) 89XXX XXX_(ZZ_Top_album) 87XXX XXX_(Danny_Brown_album) 67– PageRank for disambiguating locations (wiki ids are not available)• Expansion to query patterns– Entity name + context, e.g. brad pitt actor
- 40 -Ranking and Disambiguation challenges• Disambiguation cases that are too close to call– Fargo Fargo_(film) 3969– Fargo Fargo,_North_Dakota 4578• Disambiguation across Wikipedia and other sources
- 41 -Evaluation #2: Side-by-side testing• Comparing two systems– A/B comparison, e.g. current system under development and productionsystem– Scale: A is better, B is better• Separate tests for relevance and image quality– Image quality can significantly influence user perceptions– Images can violate safe search rules• Classification of errors– Results: missing important results/contains irrelevant results, too few results,entities are not fresh, more/less diverse, should not have triggered– Images: bad photo choice, blurry, group shots, nude/racy etc.• Notes– Borderline, set one entities relate to the movie Psy but the query is most likelyabout Gangnam style– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of60’s musicians.– There is absolutely no relation between Finland and California.
- 42 -Evaluation #3: Bucket testing• Also called online evaluation– Comparing against baseline version of the system– Baseline does not change during the test• Small % of search traffic redirected to test system, anothersmall % to the baseline system• Data collection over at least a week, looking for stat.significant differences that are also stable over time• Metrics in web search– Coverage and Click-through Rate (CTR)– Searches per browser-cookie (SPBC)– Other key metrics should not impacted negatively, e.g.Abandonment and retry rate, Daily Active Users (DAU),Revenue Per Search (RPS), etc.
- 43 -Coverage before and after the new system0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80DaysCoverageCoverage before SparkTrend before SparkCoverage after SparkTrend after SparkSpark is deployedin productionBefore release:Flat, lowerAfter release:Flat, higher
- 44 -Click-through rate (CTR) before and after the new systemBefore release:Graduallydegrading performancedue to lack of fresh dataAfter release:Learning effect:users are starting touse the tool again0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80DaysCTRCTR before SparkTrend before SparkCTR after SparkTrend after SparkSpark is deployedin production
- 45 -Summary• Spark– System for related entity recommendations• Knowledge base• Extraction of signals from query logs and other user-generatedcontent• Machine learned ranking• Evaluation• Other applications– Recommendations on topic-entity pages
- 46 -Future work• New query types– Queries with multiple entities• adele skyfall– Question-answering on keyword queries• brad pitt movies• brad pitt movies 2010• Extending coverage– Spark now live in CA, UK, AU, NZ, TW, HK, ES• Even fresher data– Stream processing of query log data• Data quality improvements• Online ranking with post-retrieval features
- 47 -The End• Many thanks to– Barla Cambazoglu and Roi Blanco (Barcelona)– Nicolas Torzec (US)– Libby Lin (Product Manager, US)– Search engineering (Taiwan)• Contact– email@example.com– @pmika