• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Related Entity Finding on the Web
 

Related Entity Finding on the Web

on

  • 2,836 views

Keynote at the Web of Linked Entities (WoLE) 2013 workshop.

Keynote at the Web of Linked Entities (WoLE) 2013 workshop.

Statistics

Views

Total Views
2,836
Views on SlideShare
2,351
Embed Views
485

Actions

Likes
9
Downloads
55
Comments
0

6 Embeds 485

http://eventifier.co 274
https://twitter.com 195
http://eventifier.com 10
http://tweetedtimes.com 3
http://ec2-54-243-189-159.compute-1.amazonaws.com 2
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is how a human sees the world.
  • This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
  • In fact, some of these searches are so hard that the users don ’t even try them anymore
  • With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
  • Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.

Related Entity Finding on the Web Related Entity Finding on the Web Presentation Transcript

  • Related Entity Findingon the WebPeter MikaSenior Research ScientistYahoo! ResearchJoint work with B. Barla Cambazoglu and Roi Blanco
  • - 2 -Search is really fast, without necessarily being intelligent
  • - 3 -Why Semantic Search? Part I• Improvements in IR are harder and harder to come by– Machine learning using hundreds of features• Text-based features for matching• Graph-based features provide authority– Heavy investment in computational power, e.g. real-timeindexing and instant search• Remaining challenges are not computational, but inmodeling user cognition– Need a deeper understanding of the query, the content and/orthe world at large– Could Watson explain why the answer is Toronto?
  • - 4 -What it’s like to be a machine?Roi Blanco
  • - 5 -What it’s like to be a machine?✜Θ♬♬ţğ✜Θ♬♬ţğ√∞®ÇĤĪ✜★♬☐✓✓ţğ★✜✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ≠=⅚©§★✓♪ΒΓΕ℠✖Γ ±♫⅜ ⏎↵⏏☐ģğğğμλκσςτ⏎⌥°¶§ΥΦΦΦ✗✕☐
  • - 7 -Ambiguity
  • - 8 -Why Semantic Search? Part II• The Semantic Web is here– Data• Large amounts of RDF data• Heterogeneous schemas• Diverse quality– End users• Not skilled in writing complexqueries (e.g. SPARQL)• Not familiar with the data• Novel applications– Complementing document search• Rich Snippets, related entities,direct answers– Other novel search tasks
  • - 9 -Semantic Web data• Linked Data– Data published as RDF documentslinked to other RDF documents and/orusing SPARQL end-points– Community effort to re-publish largepublic datasets (e.g. Dbpedia, opengovernment data)• RDFa and microdata– Data embedded inside HTML pages– Schema.org collaboration among Bing,Google, Yahoo and Yandex– Facebook Open Graph Protocol (OGP)
  • - 10 -Other novel applications• Aggregation of search results– e.g. price comparison across websites• Analysis and prediction– e.g. world temperature by 2020• Semantic profiling– Ontology-based modeling of user interests• Semantic log analysis– Linking query and navigation logs to ontologies• Task completion– e.g. booking a vacation using a combination of services• Conversational search– e.g. PARLANCE EU FP7 projectWeb usage miningwith Semantic AnalysisFri 3pmWeb usage miningwith Semantic AnalysisFri 3pm
  • - 11 -Interactive search and task completion
  • Semantic Search
  • - 13 -Semantic Search: a definition• Semantic search is a retrieval paradigm that– Makes use of the structure of the data or explicit schemas tounderstand user intent and the meaning of content– Exploits this understanding at some part of the search process• Emerging field of research– Exploiting Semantic Annotations in Information Retrieval (2008-2012)– Semantic Search (SemSearch) workshop series (2008-2011)– Entity-oriented search workshop (2010-2011)– Joint Intl. Workshop on Semantic and Entity-oriented Search (2012)– SIGIR 2012 tracks on Structured Data and Entities• Related fields:– XML retrieval, Keyword search in databases, NL retrieval
  • - 14 -Search is required in the presence of ambiguityQueryDataKeywordsKeywordsNLQuestionsNLQuestionsForm- / facet-based InputsForm- / facet-based InputsStructured Queries(SPARQL)Structured Queries(SPARQL)OWL ontologies withrich, formalsemanticsOWL ontologies withrich, formalsemanticsStructuredRDF dataStructuredRDF dataSemi-StructuredRDF dataSemi-StructuredRDF dataRDF dataembedded intext (RDFa)RDF dataembedded intext (RDFa)Ambiguities: interpretationAmbiguities: interpretation, extraction errors, data quality, confidence/trust
  • - 15 -list searchrelated entity findingentity searchSemSearch 2010/11list completionSemSearch 2011TREC ELC taskTREC REF-LOD tasksemantic searchCommon tasks in Semantic Search
  • Related entity rankingin web search
  • - 17 -Motivation• Some users are short on time– Need for direct answers– Query expansion, question-answering, information boxes, richresults…• Other users have time at their hand– Long term interests such as sports, celebrities, movies andmusic– Long running tasks such as travel planning
  • - 18 -Example user sessions
  • - 19 -Spark: related entity recommendations in web search• A search assistance tool for exploration• Recommend related entities given the user’s current query– Cf. Entity Search at SemSearch, TREC Entity Track• Ranking explicit relations in a Knowledge Base– Cf. TREC Related Entity Finding in LOD (REF-LOD) task• A previous version of the system live since 2010• van Zwol et al.: Faceted exploration of image search results. WWW2010: 961-970
  • - 20 -Spark example I.
  • - 21 -Spark example II.
  • - 22 -How does it work?/user/torzecn/Shared/Entity-Relationship_GraphsYalinda feed parser(GFFeedMapper,GFFeedReducer)$VIS_GRID_HOME/gfstorageall/{1,2,3}/yalinda/phyfacet/data/projects/gridfaces/feed/yahooomg/Y! OMG feed parser(GFFeedMapper,GFFeedReducer)$VIS_GRID_HOME/gfstorageall/6/yahooomg/phyfacet/data/projects/gridfaces/feed/geoY! Geo feed parser(GFFeedMapper,GFFeedReducer)$VIS_GRID_HOME/gfstorageall/10/geo/phyfacet/data/projects/gridfaces/feed/yahootvY! TV feed parser(GFFeedMapper,GFFeedReducer)$VIS_GRID_HOME/gfstorageall/5/projects/gridfaces/feed/editorialdata/data/objectsEditorial feed parser(GFFeedMapper,GFEditorialMergeReducer)$VIS_GRID_HOME/gfstorageall/logicalobj/dataArchive/$VIS_GRID_HOME/gfstorageall/logicalfacet/data$VIS_GRID_HOME/gfstorageall/logicalobj/data/projects/gridfaces/feed/editorialdata/data/facetsGFDumpMain-First(DumpFirstMapper,DumpFirstReducer)Empty/missingActivatedDeactivated$VIS_GRID_HOME/rankinputSTEP 1: FEED PARSING PIPELINEGFDumpMain-Second(DumpSecondMapper,DumpSecondReducer)$VIS_GRID_HOME/rankinputTmpCreateDictionary(DictionaryMapper,DistinctReducer)$VIS_GRID_HOME/rankinput$VIS_GRID_HOME/rankprepout/dictionary$VIS_GRID_HOME/rankprepout/spark/logs/week{WNo}/data/SDS/data/search_USCreateIntermediateLogFormat(NullValueSillyMapper, DistinctReducer)CreateQuerySessionsCommonModelFilter(FiltererMapper,FiltererReducer)$VIS_GRID_HOME/rankprepout/tmp/qsessions lfi tererCreateQuerySessionsCommonModel(SillyMapper,QuerySessionsCreateSessionsReducer)$VIS_GRID_HOME/rankprepout/tmp/qsessions_cmCreateQueryTermsJoinDictionaryAndLogs(LeftJoinMapper,LeftJoinReducer)$VIS_GRID_HOME/rankprepout/tmp/qtermsjoindictionarylogCreateQueryTermsJoinQTermsAndLogs(SillyMapper,ImplodeReducer)$VIS_GRID_HOME/rankprepout/tmp/qterms_cmCreateFlickrIntermediateLogFormat(FlickrReformatMapper)/projects/gridfaces/ifl ckr/feed$VIS_GRID_HOME/rankprepout/tmp/ifl ckrtagsintermediateCreateFlickrTagsCommonModelFilter(FiltererMapper,FiltererReducer)$VIS_GRID_HOME/rankprepout/tmp/ifl ckrtags lfi tererCreateFlickrTagsCommonModelFilter(SillyMapper,ImplodeReducer)$VIS_GRID_HOME/rankprepout/tmp/ifl ckr_cmGeneralQuery logsFlickrTwitter/projects/rtds/twitter/refi hoseCreateTwitterIntermediateLogFormat(NullValueSillyMapper,DistinctReducer)$VIS_GRID_HOME/rankprepout/spark/twitter_logs/week{WNo}CreateTweetsCommonModelFilter(FiltererMapper,FiltererReducer)$VIS_GRID_HOME/rankprepout/tmp/tweets lfi tererCreateTweetsCommonModel(SillyMapper,ImplodeReducer)$VIS_GRID_HOME/rankprepout/tmp/tweets_cmDistinctUsersTwitter(SillyMapper,DistinctReducer)$VIS_GRID_HOME/rankprepout/tmp/tweets/distinctusersDistinctUsers ifl ckr(SillyMapper,DistinctReducer)$VIS_GRID_HOME/rankprepout/tmp/ ifl ckr/distinctusersDistinctUsersQSessions(SillyMapper,DistinctReducer)$VIS_GRID_HOME/rankprepout/tmp/qsessions/distinctusersDistinctUsersQTerms(SillyMapper,DistinctReducer)$VIS_GRID_HOME/rankprepout/tmp/qterms/distinctusersCountUsers(CounterMapper,CounterReducer)$VIS_GRID_HOME/rankprepout/tmp/distinctusersCountEvents(CounterMapper,CounterReducer)$VIS_GRID_HOME/rankprepout/tmp/counteventsSTEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/probabilityEventProbabilityQTerms(ProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/qterms_cm$VIS_GRID_HOME/rankprepout/tmp/counteventsEventConditionalProbabilityQterms(ConditionalProbabilityMapper,ConditionalProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/conditionalprobabilityEventJointProbabilityQterms(JointProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/jointprobabilityEventJointUserProbabilityQTerms(JointUserProbabilityMapper,JointUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/countusers$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/jointuserprobabilityEventConditionalUserProbabilityQterms(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability1/qtermsConditionalUserProbability2_3-qt(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/qterms/query$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/qterms/queryfacetEventConditionalUserProbability3_qterms(SillyMapper,ConditionalUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/conditionaluserprobabilityEventEntropyQTerms(EntropyMapper,EntropyReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/entropyEventPMI1QTerms(SillyMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/pmiqtermsEventPMIQTerms(PMIMapper,PMIReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/pmiEventKLDivergenceUnary1QTerms(KLDivergenceUnaryJoinerMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/klunaryqtermsEventKLDivergenceUnaryQTerms(SillyMapper,KLDivergenceUnaryReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/kldivergenceunaryEventCosineSimilarityQTerms(PMIMapper,CosineSimilarityReducer)$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/cosinesimilarityExtracted featuresOthersSTEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/probabilityEventProbabilityFlickr(ProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/ifl ckr_cm$VIS_GRID_HOME/rankprepout/tmp/counteventsEventConditionalProbabilityFlickr(ConditionalProbabilityMapper,ConditionalProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/conditionalprobabilityEventJointProbabilityFlickr(JointProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/jointprobabilityEventJointUserProbabilityFlickr(JointUserProbabilityMapper,JointUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/countusers$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/jointuserprobabilityEventConditionalUserProbabilityFlickr(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability1/ ifl ckrConditionalUserProbability2_3-fl(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/ifl ckr/query$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/ifl ckr/queryfacetEventConditionalUserProbability3_ ifl ckr(SillyMapper,ConditionalUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/ ifl ckr/week{WNo}/conditionaluserprobabilityEventEntropyFlickr(EntropyMapper,EntropyReducer)$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/entropyEventPMI1Flickr(SillyMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/pmi ifl ckrEventPMIFlickr(PMIMapper,PMIReducer)$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/pmiEventKLDivergenceUnary1Flickr(KLDivergenceUnaryJoinerMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/klunary ifl ckrEventKLDivergenceUnaryFlickr(SillyMapper,KLDivergenceUnaryReducer)$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/kldivergenceunaryEventCosineSimilarityFlickr(PMIMapper,CosineSimilarityReducer)$VIS_GRID_HOME/rankprepout/spark/ ifl ckr/week{WNo}/cosinesimilarityExtracted featuresOthersSTEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/probabilityEventProbabilityTweets(ProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/tweets_cm$VIS_GRID_HOME/rankprepout/tmp/counteventsEventConditionalProbabilityTweets(ConditionalProbabilityMapper,ConditionalProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/conditionalprobabilityEventJointProbabilityTweets(JointProbabilityMapper,ProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/jointprobabilityEventJointUserProbabilityTweets(JointUserProbabilityMapper,JointUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/tmp/countusers$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/jointuserprobabilityEventConditionalUserProbabilityTweets(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability1/tweetsConditionalUserProbability2_3-tw(ConditionalUserProbabilityPrepareMapper)$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/tweets/query$VIS_GRID_HOME/rankprepout/tmp/conditionaluserprobability2/tweets/queryfacetEventConditionalUserProbability3_tweets(SillyMapper,ConditionalUserProbabilityReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/conditionaluserprobabilityEventEntropyTweets(EntropyMapper,EntropyReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/entropyEventPMI1Tweets(SillyMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/pmitweetsEventPMITweets(PMIMapper,PMIReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/pmiEventKLDivergenceUnary1Tweets(KLDivergenceUnaryJoinerMapper,JoinUnaryMetricReducer)$VIS_GRID_HOME/rankprepout/tmp/klunarytweetsEventKLDivergenceUnary1Tweets(SillyMapper,KLDivergenceUnaryReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/kldivergenceunaryEventCosineSimilarityTweets(PMIMapper,CosineSimilarityReducer)$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/cosinesimilarityExtracted featuresOthersSTEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/probabilityunaryfeaturemerger1_qterms(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary1_qterms$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/conditionalprobability$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/jointprobability$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/jointuserprobability$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/conditionaluserprobability$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/entropy$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/pmi$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/kldivergenceunary$VIS_GRID_HOME/rankprepout/spark/qterms/week{WNo}/cosinesimilaritySTEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINEunaryfeaturemerger2_qterms(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary2_qtermssymmetricfeaturemerger_qterms(MergeFeaturesMapper,SymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/symmetric_qtermsasymmetricfeaturemerger_qterms(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/asymmetric_qtermsreverseasymmetricfeaturemerger_qterms(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_qtermsFeaturesOthers$VIS_GRID_HOME/rankinput$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/probabilityunaryfeaturemerger1_qsessions(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary1_qsessions$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/conditionalprobability$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/jointprobability$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/jointuserprobability$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/conditionaluserprobability$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/entropy$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/pmi$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/kldivergenceunary$VIS_GRID_HOME/rankprepout/spark/qsessions/week{WNo}/cosinesimilaritySTEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINEunaryfeaturemerger2_qsessions(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary2_qsessionssymmetricfeaturemerger_qsessions(MergeFeaturesMapper,SymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/symmetric_qsessionsasymmetricfeaturemerger_qsessions(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/asymmetric_qsessionsreverseasymmetricfeaturemerger_qsessions(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_qsessionsFeaturesOthers$VIS_GRID_HOME/rankinput$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/probabilityunaryfeaturemerger1_ ifl ckr(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary1_ ifl ckr$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/conditionalprobability$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/jointprobability$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/jointuserprobability$VIS_GRID_HOME/rankprepout/spark/ ifl ckr/week{WNo}/conditionaluserprobability$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/entropy$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/pmi$VIS_GRID_HOME/rankprepout/spark/ifl ckr/week{WNo}/kldivergenceunary$VIS_GRID_HOME/rankprepout/spark/ ifl ckr/week{WNo}/cosinesimilaritySTEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINEunaryfeaturemerger2_ ifl ckr(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary2_ ifl ckrsymmetricfeaturemerger_ ifl ckr(MergeFeaturesMapper,SymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/symmetric_ ifl ckrasymmetricfeaturemerger_ ifl ckr(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/asymmetric_ ifl ckrreverseasymmetricfeaturemerger_ ifl ckr(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_flickrFeaturesOthers$VIS_GRID_HOME/rankinput$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/probabilityunaryfeaturemerger1_tweets(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary1_tweets$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/conditionalprobability$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/jointprobability$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/jointuserprobability$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/conditionaluserprobability$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/entropy$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/pmi$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/kldivergenceunary$VIS_GRID_HOME/rankprepout/spark/tweets/week{WNo}/cosinesimilaritySTEP 4d: FEATURE MERGING (TWEETS) PIPELINEunaryfeaturemerger2_tweets(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/unary2_tweetssymmetricfeaturemerger_tweets(MergeFeaturesMapper,SymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/symmetric_tweetsasymmetricfeaturemerger_tweets(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/asymmetric_tweetsreverseasymmetricfeaturemerger_tweets(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_tweetsFeaturesOthers$VIS_GRID_HOME/rankinputSTEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINEcombinedfeaturemerger_qsessions(CombinedFeatureMergerMapper)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_qsessions$VIS_GRID_HOME/rankprepout/tmp/statsmerge/combined_qsessions$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_qtermscombinedfeaturemerger_qterms(CombinedFeatureMergerMapper)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/combined_qterms$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_ ifl ckrcombinedfeaturemerger_ ifl ckr(CombinedFeatureMergerMapper)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/combined_ ifl ckr$VIS_GRID_HOME/rankprepout/tmp/statsmerge/reverseasymetric_tweetscombinedfeaturemerger_tweets(CombinedFeatureMergerMapper)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/combined_tweetsjoinfeatures(JoinerMapper,JoinerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeatures3STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINEFeaturesOthers$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeatures3$VIS_GRID_HOME/rankinputgraph_sharedconnect_1(GRSharedConnectMapper,GRSharedConnectReducer)$VIS_GRID_HOME/rankprepout/graph/sharedconnectgraph_sharedconnect_2(NullValueSillyMapper,CounterReducer)$VIS_GRID_HOME/rankprepout/graph/sharedconnect_2join_graph(MergeFeaturesMapper,SymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_g_1graph_popularity_rank_all(GRPopularityRankMapper,GRPopularityRankReducer)$VIS_GRID_HOME/rankprepout/graph/popularity_rank_allgraph_popularity_rank_directed(GRPopularityRankMapper,GRPopularityRankReducer)$VIS_GRID_HOME/rankprepout/graph/popularity_rank_directedunaryfeaturemerger_entpopmov(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_pop1unaryfeaturemerger_entpopmov2(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_pop2unaryfeaturemerger_entpopmov3(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_pop3unaryfeaturemerger_entpopmov4(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_pop4STEP 5c: FEATURE MERGING (POPULARITY) PIPELINEFeaturesOthers$VIS_GRID_HOME/rankinput$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeature_pop4WebCitationTotalHitsNormalization(SillyMapper)$VIS_GRID_HOME/web_citation/total_hits$VIS_GRID_HOME/rankprepout/tmp/webcitation_totalhitsWebCitationDeepHitsNormalization(SillyMapper)$VIS_GRID_HOME/web_citation/deep_hits$VIS_GRID_HOME/rankprepout/tmp/webcitation_deephitsasymmetricfeaturemerger_webcitation(MergeFeaturesMapper,AsymmetricFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/asymmetric_webcitationcoverage(QueryCountMapper,QueryCountReducer)$VIS_GRID_HOME/rankprepout/spark/coverage/week{WNo}$VIS_GRID_HOME/rankprepout/spark/logs/week{WNo}joincov1(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeaturescov1joincov2(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeaturescov2joinwikipop1(MergeFeaturesMapper,UnaryFeatureMergerReducer)/user/barla/Spark/wikiResultCounts$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeatureswikipop1joinwikipop2(MergeFeaturesMapper,UnaryFeatureMergerReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/joinfeatureswikipop2jointypes(EntityRelationTypeMapper,EntityRelationTypeReducer)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/jointypesSTEP 6: RANKING PIPELINEMLR scoring(/homes/barla/gf_mlr/mlr_rank)$VIS_GRID_HOME/rankprepout/tmp/statsmerge/jointypes/homes/barla/gf_mlr/gbrank.xml/homes/barla/gf_mlr/header.tsv$VIS_GRID_HOME/rankprepout/spark/rankingMLR scoring(SillyMapper,DisambiguationReducer)$VIS_GRID_HOME/rankprepout/spark/ranking-disambiguatedgroupmax(SillyMapper,OutputMaxReducer)$VIS_GRID_HOME/rankprepout/tmp/rankingformatranking(RankingFormatterMapper,RankingFormatterReducer)$VIS_GRID_HOME/rankprepout/spark/ranking nfi alSTEP 7: DATAPACK GENERATION PIPELINE$VIS_GRID_HOME/rankprepout/spark/ranking nfi almergerank(GFFeedMapper,GFFeedReducer)$VIS_GRID_HOME/gfrankout$VIS_GRID_HOME/gfrankout/1/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfrankout/2/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfrankout/3/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfrankout/4/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfrankout/5/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfrankout10/geo/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/1/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/2/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/3/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/4/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/5/yalinda/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/10/geo/phyfacet/ranking/spark$VIS_GRID_HOME/gfstorageall/1/yalinda/phyfacet/data$VIS_GRID_HOME/gfstorageall/2/yalinda/phyfacet/data$VIS_GRID_HOME/gfstorageall/3/yalinda/phyfacet/data$VIS_GRID_HOME/gfstorageall/4/yalinda/phyfacet/data$VIS_GRID_HOME/gfstorageall/5/yalinda/phyfacet/data$VIS_GRID_HOME/gfstorageall/10/geo/phyfacet/data$VIS_GRID_HOME/gfstorageall/logicalobj/datadumper1(DumpFirstMapper,DumpFirstReducer)$VIS_GRID_HOME/platdumpTmpdumper2(DumpSecondMapper,DumpSecondReducer)$VIS_GRID_HOME/platdumpdatapack(Mapper,YISFacetMergeReducer)$VIS_GRID_HOME/datapack
  • - 23 -High-Level Architecture ViewEntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeatures
  • - 24 -Spark ArchitectureEntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeatures
  • - 25 -EntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeaturesData Preprocessing
  • - 26 -Entity graph• 3.4 million entities, 160 million relations• Locations: Internet Locality, Wikipedia, Yahoo! Travel• Athletes, teams: Yahoo! Sports• People, characters, movies, TV shows, albums: Dbpedia• Example entities• Dbpedia Brad_Pitt Brad Pitt Movie_Actor• Dbpedia Brad_Pitt Brad Pitt Movie_Producer• Dbpedia Brad_Pitt Brad Pitt Person• Dbpedia Brad_Pitt Brad Pitt TV_Actor• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person• Example relations• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor
  • - 27 -Entity graph challenges• Coverage of the query volume– New entities and entity types– Additional inference– International data– Aliases, e.g. jlo, big apple, thomas cruise mapother iv• Freshness– People query for a movie long before it’s released• Irrelevant entity and relation types– E.g. voice actors who co-acted in a movie, cities in a continent• Data quality– United States Senate career of Barack Obama is not a person– Andy Lau has never acted in Iron Man 3
  • - 28 -EntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeaturesFeature extraction
  • - 29 -Feature extraction from text• Text sources– Query terms– Query sessions– Flickr tags– Tweets• Common representationInput tweet:Brad Pitt married to Angelina Jolie in Las VegasOutput event:Brad Pitt + Angelina JolieBrad Pitt + Las VegasAngelina Jolie + Las Vegas
  • - 30 -Features• Unary– Popularity features from text: probability, entropy, wiki idpopularity …– Graph features: PageRank on the entity graph, wikipedia, webgraph– Type features: entity type• Binary– Co-occurrence features from text: conditional probability, jointprobability …– Graph features: common neighbors …– Type features: relation type
  • - 31 -Feature extraction challenges• Efficiency of text tagging– Hadoop Map/Reduce• More features are not always better– Can lead to over-fitting without sufficient training data
  • - 32 -EntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeaturesModel Learning
  • - 33 -Model Learning• Training data created by editors (five grades)400 Brandi adriana lima Brad Pitt person Embarassing1397 David H. andy garcia Brad Pitt person Mostly Related3037 Jennifer benicio del toro Brad Pitt person Somewhat Related4615 Sarah burn after reading Brad Pitt person Excellent9853 Jennifer fight club movie Brad Pitt person Perfect• Join between the editorial data and the feature file• Trained a regression model using GBDT–Gradient Boosted Decision Trees• 10-fold cross validation optimizing NDCG and tuning•number of trees•number of nodes per tree
  • - 34 -Feature importanceRANK FEATURE IMPORTANCE1 Relation type 1002 PageRank (Related entity) 99.60753 Entropy – Flickr 94.78324 Probability – Flickr 82.61725 Probability – Query terms 78.93776 Shared connections 68.2967 Cond. Probability – Flickr 68.04968 PageRank (Entity) 57.60789 KL divergence – Flickr 55.460410 KL divergence – Query terms 55.0662
  • - 35 -Impact of training dataNumber of training instances (judged relations)
  • - 36 -Performance by query-entity type•High overall performance but some types are more difficult•Locations– Editors downgrade popular entities such as businessesNDCG by type of the query entity
  • - 37 -Model Learning challenges• Editorial preferences not necessarily coincide with usage– Users click a lot more on people than expected– Image bias?• Alternative: optimize for usage data– Clicks turned into labels or preferences– Size of the data is not a concern– Gains are computed from normalized CTR/COEC– See van Zwol et al. Ranking Entity Facets Based on User ClickFeedback. ICSC 2010: 192-199.couple of hundred entities and their facets we find thatlinear combination of the conditional probabilities givest performance on the collected judgements using wqt = 2,= 0.5, and wf t = 1. However, the editorial data was notstantial enough to learn a ranking with GBDT.Click-through Rate versus Click over Expected ClickFrom the image search query logs, we collect the user clicka that is related to the facets. This allows us to compute theck-through rate (CTR) on a facet for a given entity that isected in a user query and for which the facets were shownhe user. Let clickse,f be the number of clicks on a facetty f show in relation to entity e, and viewse,f the numbertimes the facet f is shown to a user for a related entity e,n the probability of a click on a facet entity f for a giventy e can be modelled as ctre,f :ctre,f =clickse,fviewse,f(2)n Figure 3 the conditional click-through rate is shown forfirst ten positions. It shows the CTR per position for everyge view where one of the facets is clicked, aggregated overcoece,f =clPPp=1 viZhang and Jones [3] refer toexpected clicks, based on the deexpected clicks given the positioC. Gradient Boosted Decision TrStochastic gradient boosted decthe most widely used learning algtoday. Gradient tree boosting cosion model, utilizing decision trOne advantage over other learntrees in general is that the featare highly interpretable. GBDTdifferent loss functions can be uresearch presented here we usedour loss function. In related work,pairwise and ranking specific lowell at improving search relevancshallow decision trees, trees in son a randomly selected subset ofprone to over-fitting [14]. For theshown in the search enginehe ground truth for creatingset used by the gradientonal Probabilitiesof the facets search expe-unction rank(e, f) that isonal probabilities extracted⇥Pqs(f|e)+wf t ⇥Pf t (f, e)(1)e) are the conditional prob-he weights for the differentqt), query session (qs) andl judgements collected fortheir facets we find thatditional probabilities givesudgements using wqt = 2,the editorial data was notng with GBDT.k over Expected Clickgs, we collect the user clickall entities. Observe that the CTR declines when the positionat which a facet is shown increases.We introduce a second click model, based on the notionof clicks over expected clicks (COEC). To allows us to dealwith the so called position bias – where facets appearing inlower positions are less likely to be clicked even if they arerelevant [2]. This phenomenon isoften observed in Web searchand we adopt the COEC model proposed by Chapelle andZhang [11]. In that model, we estimate ctrp as the aggregatedctr – over all queries and sessions – in position p for allpositions P. Let then clickse,f be the number of clicks ona facet entity f show in relation to entity e, and viewse,f pthenumber of times the facet f is shown to a user for a relatedentity e at position p. The probability of a click over expectedclick on a facet entity f for a given entity e can then bemodelled as coece,f :coece,f =clickse,fPPp=1 viewse,f p ⇥ ctrp(3)Zhang and Jones [3] refer to this method as clicks overexpected clicks, based on the denominator that includes theexpected clicks given the positions that the url appeared in.C. Gradient Boosted Decision TreesStochastic gradient boosted decision trees (GBDT) is one of
  • - 38 -EntitygraphDatapreprocessingFeatureextractionModellearningFeaturesourcesEditorialjudgementsDatapackRankingmodelRanking anddisambiguationEntitydataFeaturesRanking and Disambiguation
  • - 39 -Ranking and Disambiguation• We apply the ranking function offline to the data• Disambiguation– How many times a given wiki id was retrieved for queries containing the entityname?Brad Pitt Brad_Pitt 21158Brad Pitt Brad_Pitt_(boxer) 247XXX XXX_(movie) 1775XXX XXX_(Asia_album) 89XXX XXX_(ZZ_Top_album) 87XXX XXX_(Danny_Brown_album) 67– PageRank for disambiguating locations (wiki ids are not available)• Expansion to query patterns– Entity name + context, e.g. brad pitt actor
  • - 40 -Ranking and Disambiguation challenges• Disambiguation cases that are too close to call– Fargo Fargo_(film) 3969– Fargo Fargo,_North_Dakota 4578• Disambiguation across Wikipedia and other sources
  • - 41 -Evaluation #2: Side-by-side testing• Comparing two systems– A/B comparison, e.g. current system under development and productionsystem– Scale: A is better, B is better• Separate tests for relevance and image quality– Image quality can significantly influence user perceptions– Images can violate safe search rules• Classification of errors– Results: missing important results/contains irrelevant results, too few results,entities are not fresh, more/less diverse, should not have triggered– Images: bad photo choice, blurry, group shots, nude/racy etc.• Notes– Borderline, set one entities relate to the movie Psy but the query is most likelyabout Gangnam style– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of60’s musicians.– There is absolutely no relation between Finland and California.
  • - 42 -Evaluation #3: Bucket testing• Also called online evaluation– Comparing against baseline version of the system– Baseline does not change during the test• Small % of search traffic redirected to test system, anothersmall % to the baseline system• Data collection over at least a week, looking for stat.significant differences that are also stable over time• Metrics in web search– Coverage and Click-through Rate (CTR)– Searches per browser-cookie (SPBC)– Other key metrics should not impacted negatively, e.g.Abandonment and retry rate, Daily Active Users (DAU),Revenue Per Search (RPS), etc.
  • - 43 -Coverage before and after the new system0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80DaysCoverageCoverage before SparkTrend before SparkCoverage after SparkTrend after SparkSpark is deployedin productionBefore release:Flat, lowerAfter release:Flat, higher
  • - 44 -Click-through rate (CTR) before and after the new systemBefore release:Graduallydegrading performancedue to lack of fresh dataAfter release:Learning effect:users are starting touse the tool again0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80DaysCTRCTR before SparkTrend before SparkCTR after SparkTrend after SparkSpark is deployedin production
  • - 45 -Summary• Spark– System for related entity recommendations• Knowledge base• Extraction of signals from query logs and other user-generatedcontent• Machine learned ranking• Evaluation• Other applications– Recommendations on topic-entity pages
  • - 46 -Future work• New query types– Queries with multiple entities• adele skyfall– Question-answering on keyword queries• brad pitt movies• brad pitt movies 2010• Extending coverage– Spark now live in CA, UK, AU, NZ, TW, HK, ES• Even fresher data– Stream processing of query log data• Data quality improvements• Online ranking with post-retrieval features
  • - 47 -The End• Many thanks to– Barla Cambazoglu and Roi Blanco (Barcelona)– Nicolas Torzec (US)– Libby Lin (Product Manager, US)– Search engineering (Taiwan)• Contact– pmika@yahoo-inc.com– @pmika