• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Aardvark shalini
 

Aardvark shalini

on

  • 481 views

Paper Presentation

Paper Presentation

Statistics

Views

Total Views
481
Views on SlideShare
481
Embed Views
0

Actions

Likes
0
Downloads
4
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Aardvark shalini Aardvark shalini Presentation Transcript

    • AardvarkThe Anatomy of a Large-Scale Social Search Engine by Damon Horowitz and Sepandar D. Kamvar Presented by Shalini Sahoo 11/21/2011
    • Introduction• Library vs village paradigm• Traditional IR approaches follow the library paradigm• In a village, information is passed from person to person• The retrieval task consists of finding the right person (expert in that field)• Example queries: (Pg 1) “Do you have any good baby-sitter recommendations in Palo-Alto for my 6-year-old twins? I’m looking for somebody that won’t let them watch TV.” “Is it safe for me to take a cab alone at 3 am from SFO airport to my home in Berkeley?” 2
    • Differences Library Village Keywords are used to search Natural language used to ask questions Verbose, highly contextualized and Short queries (~2.93 words) subjective (~18.6 words)Knowledge base created by content Community forms the knowledge base publishers Trust is based on authority Trust is based on intimacyRetrieval involves finding the right Retrieval involves finding the right document person 3
    • Aardvark• Is* a social search engine based on the village paradigm• It connects users live with friends or friends-of-friends who are able to answer their questions• Users submit questions via Aardvark’s website, email, instant messenger or app on mobile devices• It identifies and facilitates a live chat or email conversation with one or more topic experts in the users extended social network• It was mainly used for asking subjective questions for which human judgement or recommendation was desired* was - Aardvark was shut down on September 30th 2011 4
    • Aardvark • It was originally developed by The Mechanical Zoo, a San-Francisco- based startup founded in 2007 by Max Ventilla, Damon HorowitzA prototype version It was released Google acquired it Google shut down was launched to public for $50 million Aardvark* Early 2008 March 2009 February 2010 September 2011 * A fall spring-clean: http://googleblog.blogspot.com/2011/09/fall-spring-clean.html 5
    • Outline• Overview• Anatomy• Examples• Analysis• Evaluation• Discussion 6
    • Outline➡ Overview ‣ Main Components ‣ The Initiation of a User ‣ The Life of a Query• Anatomy• Examples• Analysis• Evaluation• Discussion 7
    • Main Components• Crawler and Indexer: To find and label resources that contain information• Query Analyzer: To understand the user’s information need• Ranking Function: To select the best resources to provide the information• User Interface: To present the information to the user in an accessible and interactive form 8
    • The Initiation of a User• The first step involves forming the “Social Graph”• Users can import contacts from: - social networking sites like Facebook or LinkedIn - webmail program like Gmail or Yahoo mail - invite friends to join• Users in a common group or community (e.g. studied at UT Austin, Google summer interns 2011) are added to the social graph• User’s topical expertise information is indexed: - Users can indicate the topics in which they have expertise - User’s friend can select topics for which they trust the user’s opinion - Users can indicate their personal webpages or blogs - User’s status updates from Facebook or Twitter (if available) 9
    • The Initiation of a User• Forward Index: stores the userId, a scored list of topics, further scores about user behavior• From this forward index, an inverted index is constructed• Inverted Index: stores each topicId and a scored list of userIds (with expertise in that topic)• Inverted index also stores scored list of userIds for features like answer quality and response time 10
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • The Life of a Query Question Analyzer classifiers Gateways IM Email Transport Msgs Conversation Manager RSR Layer Twitter Routing iPhone Engine Web {RS}* SMS Index Importers Database Web Content Users Facebook Graph LinkedIn Topics Etc. Content 11
    • Outline✓ Overview➡ Anatomy ‣ The Model ‣ Social Crawling ‣ Indexing People ‣ Analyzing Questions ‣ Ranking Algorithm ‣ User Interface• Examples• Analysis• Evaluation• Discussion 12
    • , Twitter or 3. ANATOMY xtracts top-sis (see Sec- 3.1 The Model ark observes s. The Model (or electing The core of Aardvark is a statistical model for routing questions to potential answerers. We use a network variant of what has been called an aspect model [12], that has two recorded in primary features. First, it associates an unobserved classscored list of variable t 2 T with each observation (i.e., the successfuler’s behavior • A network variantqof an aspect In other is usedthe proba- answer of question by user ui ). model words,the Forward - bility p(ui |q) that user i will successfully answer tquestion q each It associates an unobserved class variable ∊ T withThe Inverted depends on whether q is about the topics t inof question q by observation (i.e., the successful answer which ui has userIds that expertise1 :pics, the In- user ui) Xfeatures like p(ui |q) = p(ui |t)p(t|q) (1) The second main feature of the model is that it defines t2T breadth of th a query-independent probability of success for each poten- signing interor a user are 1 nd ready to - Equation 1 a query-independent probabilitydegree of tialdefines is a simplification based upon their of success for extended It asker/answerer pair (ui , uj ), of what Aardvark actually an uses to connectedness and profile similarity.we present it this match queries to answerers, but In social potential asker/answerer pair (u , other words, in the next s each clarity and conciseness. way define a probability p(u |u ) that user u i will deliver a uj) we for i j i satisfying answer to user u, ,u q) is defined as a composition j regardless of the question. 3.3 Inde • The We then define the scoring function s(u , u , q) as the com- scoring function s(ui j, The centra i j of position ofprobabilities the two the two probabilities. the right use X In order to s(ui , uj , q) = p(ui |uj ) · p(ui |q) = p(ui |uj ) p(ui |t)p(t|q) to learn abo t2T be able to a (2) users uj to w Our goal in the ranking problem is: given a question q Topics. A from user uj , return a ranked list of users ui 2 U that 13 topics known
    • The Model• The goal of the ranking algorithm is: given a question q from user uj, return a ranked list of users ui ∊ U that maximizes the scoring function s(ui, uj, q)• The scoring function allows real-time routing because much of the computation is done offline• The only term which needs to be computed at query-time is p(t|q)• The distribution p(ui|t) assigns users to topics, and the distribution p(ui|uj) defines the Aardvark social graph, both of these are computed by the Indexer at signup time 14
    • Social Crawling• In Aardvark, people form the knowledge base rather than documents• The more active users there are, the more potential answerers• So it is important for Aardvark to create a good experience for users so that they remain active and inclined to invite their friends• The breadth of the Aardvark knowledge base depends upon designing interfaces and algorithms to update the topic lists for each user over time 15
    • Indexing People• The distribution (p(t|uj)) of topics known by user ui is computed from the following sources: - Users can indicate the topics in which they have expertise - User’s friend can select topics for which they trust the user’s opinion - Users can indicate their personal webpages or blogs - User’s status updates from Facebook or Twitter (if available)• Over time, Aardvark learns which topics not to send to a particular user by keeping track of: - when user explicitly “mutes” a topic - declines to answer questions about a topic when given the chance - receives negative feedback on his answer from the asker 16
    • Indexing People level of expertise than if he were alone in his group with knowledge in that area. Mathematically, for some user ui , It is imp requiremen his group of friends U , and some topic t, if p(t|ui ) 6= 0, P understand then s(t|ui ) = p(t|ui ) + u2U p(t|u), where is a small answerer.• Periodically, astopic strengthening algorithm form prob- constant. The values are then renormalized to is used lenge facin• For a user ui and his group of friends U, and for some topic to determi abilities. Aardvark then runs two smoothing algorithms the pur- seeking (i.e t, if p(t|ui) ≠ 0, then pose of which are to record the possibility that the user may mation nee be able to answer questions about additional topics not ex- a given w plicitly recorded in her profile.+The u ∊ U p(t|u) s(t|ui) = p(t|ui) n∑ first uses basic collabo- contrast, i rative filtering techniques on topics (i.e., based on users with who has th similar topics), the n is a small semantic similarity2 . where second uses constant an answer Once all of these bootstrap, extraction, and smoothing man intell• Other smoothing techniques are used to record the methods are applied, we have a list of topics and scores asker can possibility that a user might be able topic scoresabout for a given user. Normalizing these to answer so that and the h P additional itopics not explicitly mentioned in their profile derstandin t2T p(t|u ) = 1, we have a probability distribution for topics known by user ui . Using Bayes’ Law, we compute for voice, sens• p(ui|t) is computed using Bayes’ law each topic and user: forth, to d in a respo p(t|ui )p(ui ) p(ui |t) = , (3) a social se p(t) question th using a uniform distribution for p(ui ) and observed topic 17 knowledge
    • Connectedness• Connectedness between users p(ui|uj) is computed using a weighted cosine similarity over the following feature set: - Social connection - Demographic similarity - Profile similarity - Vocabulary match - Chattiness match - Verbosity match - Politeness match - Speed match• p(ui|uj) is stored in the social graph 18
    • Analyzing Questions• The main goal of the Question Analyzer is: given a question q , determine a scored list of topics p(t|q) for each question• The following classifiers are run on a question: - NonQuestion Classifier - InappropriateQuestion Classifier - TrivialQuestion Classifier - LocationSensitive Classifier• Next, the list of relevant topics is produced by merging outputs from several TopicMapper algorithms - KeywordMatchTopicMapper - TaxonomyTopicMapper - SalientTermTopicMapper - UserTagTopicMapper 19
    • Analyzing Questions• The TopicMapper algorithms are continuously evaluated• Given a question all the returned topics to select an answerer, and a much larger list of relevant topics are assigned scores by two human judges• 89% precision and 84% recall of relevant topics 20
    • Ranking Algorithm• The topic list generated by the Question Analyzer is sent to the Routing Engine which then determines the top answerers for the given question• The main factors that determines the ranking of users are: - Topic expertise p(ui|q) - Connectedness p(ui|uj) - Availability• From this ordered list of users the Routing Engine then filters out users who should not be contacted - based on preferred time of contact - based on the frequency of times they have been contacted in the recent past 21
    • User Interface• The various user interfaces of Aardvark are built on top of the real time communication channels such as IM, email, SMS, iPhone, Twitter and Web-based messaging 22
    • User Interface 23
    • User Interface 24
    • Outline✓ Overview✓ Anatomy➡ Examples• Analysis• Evaluation• Discussion 25
    • Examples EXAMPLE 1 EXAMPLE 2 (Question from Mark C./M/LosAltos,CA) (Question from James R./M/ I am looking for a restaurant in San TwinPeaksWest,SF) Francisco that is open for lunch. Must be What is the best new restaurant in San very high-end and fancy (this is for a small, Francisco for a Monday business dinner? formal, post-wedding gathering of about 8 Fish & Farm? Gitane? Quince (a little older)? people). (+7 minutes -- Answer from Paul D./M/ (+4 minutes -- Answer from Nick T./28/M/ SanFrancisco,CA -- A friend of your friend SanFrancisco,CA -- a friend of your friend Sebastian V.) Fritz Schwartz) For business dinner I enjoyed Kokkari fringale (fringalesf.com) in soma is a good Estiatorio at 200 Jackson. If you prefer a bet; small, fancy, french (the french actually place in SOMA i recommend Ozumo (a great hang out there too). Lunch: Tuesday - sushi restaurant). Friday: 11:30am - 2:30pm (Reply from James to Paul) (Reply from Mark to Nick) thx I like them both a lot but I am ready to try Thanks Nick, you are the best PM ever! something new (Reply from Nick to Mark) (+1 hour -- Answer from Fred M./29/M/ youre very welcome. hope the days theyre Marina,SF) open for lunch work... Quince is a little fancy... La Mar is pretty fantastic for cevice - like the Slanted Door of EXAMPLE 3 peruvian food... (Question from Brian T./22/M/Castro,SF) What is a good place to take a spunky, off-the-cuff, social, and pretty girl for a nontraditional, fun, memorable dinner date in San Francisco? (+4 minutes -- Answer from Dan G./M/SanFrancisco,CA) Start with drinks at NocNoc (cheap, beer/wine only) and then dinner at RNM (expensive, across the street). (Reply from Brian to Dan) Thanks! (+6 minutes -- Answer from Anthony D./M/Sunnyvale,CA -- you are both in the Google group) Take her to the ROTL production of Tommy, in the Mission. Best show ive seen all year! (Reply from Brian to Anthony) Tommy as in the Whos rock opera? COOL! (+10 minutes -- Answer from Bob F./M/Mission,SF -- you are connected through Mathias friend Samantha S.) Cool question. Spork is usually my top choice for a first date, because in addition to having great food and good really friendly service, it has an atmosphere thats perfectly in between casual and romantic. Its a quirky place, interesting funny menu, but not exactly non- traditional in the sense that youre not eating while suspended from the ceiling or anything 26
    • Examples 27
    • Examples 28
    • Outline✓ Overview✓ Anatomy✓ Examples➡ Analysis• Evaluation• Discussion 29
    • Analysis• As of October 2009, Aardvark had 90361 registered users• The average query volume was 3167.2 questions per day in this period Users 30
    • Analysis• Mobile users were particularly active - It is easier to reply to questions in the form of IM or SMS on phone - People are comfortable using natural language in an IM setting rather than in a web search setting• Questions are highly contextualized - Average query length is 18.6 words• Questions often have a subjective element websites & internet apps business research music, movies, TV, books sports & recreation home & cooking finance & investing technology & programming miscellaneous Aardvark local services travel product reviews & help restaurants & bars 31
    • music, movies, TV, books sports & recreation home & cooking finance & investing technology & programming miscellaneous Aardvark Analysis local services travel product reviews & help restaurants & bars • Questions get answered quickly Figure 8: Categories of questions sent to Aardvark 4 x 10 2.5 Questions Answered 2 ser growth 1.5 1m a coworker; and the 0.5-friend-of-friend. Theailed, came from a user 0 0−3 min 3−6 min 6−12 min 12−30 min30min−1hr 1−4 hr 4+ hrd to both “restaurants” • Answers are9: Distribution of questions and answeringures of Aardvark is that Figure of high quality times. are hypercustomized to Answers are comprehensive and concise -nt restaurant recommen-with a spunky and spon- Median answer lengthas mobile users [14].) Second, mo- - times as active was 22.2 words ing small formal family 70.4% of bile users of Aardvark are almost as active in absolute - inline feedback rated answers as ‘good’, 14.1% ratedbusiness meeting — and terms as mobile15.5%of Google (who have on average as ‘OK’ and users ize these constraints. It answers 5.68 mobile sessions perwere rated as ‘bad’ month [14]). This is quite sur- st of these examples (as prising for a service that has only been available for 6ons), the asker took the months. ing out. We believe this is for 32 reasons. First, browsing two
    • Analysis• There are a broad range of answerers• Social proximity matters• People are indexable 33
    • Outline✓ Overview✓ Anatomy✓ Examples✓ Analysis➡ Evaluation• Discussion 34
    • Evaluation• Compared to Google!• “Do you want to help Aardvark run an experiment?” was inserted into a random sample of active questions• Users were asked to reformulate their question as a query and search on Google• Users time how long it took to find a satisfactory result and also rate the quality of answers• 71.5% on Aardvark, with a mean rating of 3.93• 70.5% on Google, with a mean rating of 3.07 35
    • Outline✓ Overview✓ Anatomy✓ Examples✓ Analysis✓ Evaluation➡ Discussion 36
    • Discussion• Participation Fatigue: (Pg 9) “86.7% users have been contacted by Aardvark with a request to answer a question, and of those, 70% have looked at the question and 38% could answer a question. 20% of the users accounted for 85% of answers” What happens when this thin slice of users get overwhelmed and start dropping out?• Availability: There can be cases when the topic expert(s) in your social graph might not be online. Do you think having an “offline” mode be helpful?• Evaluation: Can we get a better understanding of how well Aardvark worked had it been compared to another social search engine which works on the same paradigm? How can that be achieved? 37