Rank by Time or by Relevance?
Revisiting Email Search
November17th, 2015
David Carmel Guy Halawi Liane Lewin-Eytan Yoelle Maarek Ariel Raviv
Haifa Labs
Motivation
▪  “Email search still remains difficult, time-consuming and
frustrating" (Elsweiler et al. 2011)
▪  By default, all existing Web mail services display search results
in reverse chronological order
▪  makes the discovery of older messages very hard
▪  Imposes strict constraints for messages matching
Email Search Today (Time ordered)
Searching for an (old) application form for “Visa to India”
Search in Yahoo Mail
▪  Boolean Search model
•  Each query is a Boolean expression (AND, OR, NOT)
•  Generally, all query terms must appear in at-least one of the
message fields (AND operation)
▪  Ranking
•  Default: by Recency (Reverse Chronological ordering)
•  (pseudo)-Relevance – implementation is based on matching
query terms
▪  almost never used by users
Challenge
▪  Challenge the traditional prevalent chronological ranking in Web email
search
›  investigate whether an email-specific relevance ranking could bring any value to our
users
▪  Introduce mail-specific relevance ranking consisting of two phases:
›  Relaxed matching phase to improve recall
›  Comprehensive ranking phase using a rich set of mail-specific features to improve
precision
▪ Very short queries: 1.5 terms on avg.
› Re-find Intent
– looking for specific previous message
› Contact queries ~40%
– Picture, Email address, Phone number,
Physical address, Links, Attachments,
Appointments (time/date), Conversation
▪ Tasks involved:
› Couponing (Pizza coupon?)
› Tracking Items (bill paid, package shipped)
› Looking up Account / Registration info
› Social media (searching for comments/posts)
Email Queries: What people search for?
▪  Standard two-phase retrieval process:
›  First Phase: Retrieve a pool of message qualified as potentially relevant
to the query
•  Two matching models:
–  Restricted (AND mode)
–  Relaxed: any message containing at least one of the query terms in any
of its fields is considered a match
›  Second phase:
•  Ranks these messages using a rich set of features
–  Scores messages by linear regression analysis learned using Learning-to-rank
approach
The Search Process
REX - Relevance EXtended Ranking Model
Based on an LTR framework using
several sets of features:
▪  Message
▪  Recipient
▪  Sender
▪  Message-Query Similarity
Message Features
▪  Freshness exponential decay over the message age
▪  User Actions replied, forwarded, flagged, drafted, read,..
▪  Attachment has attachment, attachment type / size
▪  Folder folder type (inbox, draft, sent, user defined folder)
▪  Exchange Type reply/forward, in-thread
Recipient Features
▪  To recipient mentioned in To
▪  Cc recipient mentioned in Cc
▪  In Group recipient was not mentioned explicitly
Sender Features
▪  User-sender connection correspondence volume / type
▪  Self correspondence sender is user
Vertical
▪  Sender inbound / outbound traffic volume and ratio
▪  Sender urls usage volume and ratio in messages
▪  Sender recipients number avg. per message
▪  Sender recipients actions ratio over messages
Horizontal
Message-Query Similarity Features
▪  BM25f textual similarity between a query and the entire message
•  Considering query term distribution over message fields (Subject, From, To, Body,
Attachment)
●  TF-IDF measures the (tf-idf) similarity of each message field
independently of the others
●  Coord fraction of query terms that occur in the message
Proximity
Taking into account proximity between query terms in content
▪  Neighborhood boosting consecutive matches
▪  Proximity boosting tokens found closely in a fixed
window (5) with no ordering
▪  Prefix allowing prefix match but with score decay using
length difference
Learning to Rank (LTR)
▪  Data point: < query | ~100 matched messages | klicked message >
▪  Datasets:
›  Corporate 100K random queries from the corporate query log
›  Web-mail 10K random queries
›  Editorial 500 queries judged by editors
▪  LTR Algorithm AROW (Crammer et al. 2013)
d1
d2
d3
d4 ∑wi fi(d4)
∑wi fi(d1)
>∑wi fi(d2)
∑wi fi(d3)
d1
d2
d4
d3
∑wi fi(d4) ∑wi fi(d1)
> ∑wi fi(d2)∑wi fi(d4)
update w
Learning to Rank (LTR)
Experimental Results
Performance Measures
▪  Mean reciprocal rank (MRR)
corresponds to the harmonic mean of
the ranks of the relevant documents
▪  Success@K
The number of queries for which the clicked message is
found in the top-k results
▪  NDCG@K
when we have several relevance feedback levels
Time Vs. REX (Corporate Dataset)
Algorithm MRR (+lift %)
Time 0.3722
REX (fresh. + sim.) 0.4261 (+14.48%)
REX (fresh. + sim. + actions) 0.4550 (+22.24%)
REX (fresh. + sim. + actions + sender) 0.4548 (+22.19%)
Time Vs. REX (Web Mail Dataset)
Algorithm MRR (+lift %)
Time 0.3717
REX (fresh. + sim.) 0.3785 (+1.81%)
REX (fresh. + sim. + actions) 0.4238 (+14%)
REX (fresh. + sim. + actions + sender) 0.4258 (+14.55%)
Time vs REX (as a function of the Result set size)
Relative improvement of REX over Time increases as more
messages in the user inbox match the query
Relative Feature Importance
▪  In general, REX ranker significantly outperforms Chronological ranker
›  both in the Corporate and in the Web datasets
▪  Relative Feature Importance:
Freshness >> User actions >> Similarity >> Sender features
›  Freshness:
•  Years >> Months >> Weeks >> Days
›  User actions
•  Read >> Forwarded >> Flagged >> Replied >> Draft >> Ham >> Spam
›  Similarity:
•  coord >> tf-idf (From > Subject > Body >Attachment > To) >> BM25f
▪  Low significance of the sender features ??
Time Vs. REX (Editorial Dataset)
Algorithm MRR (+lift %) NDCG@10 (+lift %)
Time 0.3629 0.4936
REX 0.5105 (+40.65%) 0.6647 (+34.66%)
Query Intent
Algo A
Most Relevant
Algo A
Related
Algo B
Most Relevant
Algo B
Related
Lila dress Discussion about dress for party 2 3,5,7 4 1,2,6,7
Spense KE Schedule for Spense KE meeting 5 2,4,8,9 1 3,4,5
Editors Feedback
“... Sometimes, I had the feeling that Algo. B was
really reading my mind to put in the first place
exactly the email message I was thinking of ...”
“...Today, after I ran it again, it was not that much
impressive, but still I have the feeling it was the
type of search that gave me the best results..."
Email Search Tomorrow (REX ordered)
Searching for an (old) application form for “Visa to India”
Conclusions
●  We Challenged the traditional chronological sort for email search
o  While freshness is still super important, it should be integrated into the
relevance model with many other important features
o  REX performs significantly better from Time-based ranking
o  The model can be easily expanded considering more signals as they become
available
●  Are mail users ready to depart from chronological sort in favor of
modern relevance ranking?
●  Time will tell
●  REX provides our users the opportunity, at least
●  More details can be found in our CIKM 2015 paper:
o  Rank by Time or by Relevance? Revisiting Email Search
Future Work
▪  Enriching the the set of Ranking features
›  Solving the mystery:
•  how come that Sender features do not contribute to the ranking
›  Adding Query based features Based on Query Intent Analysis
▪  Personalization
›  Adding the User into the ranking model
▪  User Study
›  Better understanding user needs
•  how users search over their mailboxes
Yahoo new Mobile Mail application
Thanks for listening

Rank by time or by relevance - Revisiting Email Search

  • 1.
    Rank by Timeor by Relevance? Revisiting Email Search November17th, 2015 David Carmel Guy Halawi Liane Lewin-Eytan Yoelle Maarek Ariel Raviv Haifa Labs
  • 2.
    Motivation ▪  “Email searchstill remains difficult, time-consuming and frustrating" (Elsweiler et al. 2011) ▪  By default, all existing Web mail services display search results in reverse chronological order ▪  makes the discovery of older messages very hard ▪  Imposes strict constraints for messages matching
  • 3.
    Email Search Today(Time ordered) Searching for an (old) application form for “Visa to India”
  • 4.
    Search in YahooMail ▪  Boolean Search model •  Each query is a Boolean expression (AND, OR, NOT) •  Generally, all query terms must appear in at-least one of the message fields (AND operation) ▪  Ranking •  Default: by Recency (Reverse Chronological ordering) •  (pseudo)-Relevance – implementation is based on matching query terms ▪  almost never used by users
  • 5.
    Challenge ▪  Challenge thetraditional prevalent chronological ranking in Web email search ›  investigate whether an email-specific relevance ranking could bring any value to our users ▪  Introduce mail-specific relevance ranking consisting of two phases: ›  Relaxed matching phase to improve recall ›  Comprehensive ranking phase using a rich set of mail-specific features to improve precision
  • 6.
    ▪ Very short queries:1.5 terms on avg. › Re-find Intent – looking for specific previous message › Contact queries ~40% – Picture, Email address, Phone number, Physical address, Links, Attachments, Appointments (time/date), Conversation ▪ Tasks involved: › Couponing (Pizza coupon?) › Tracking Items (bill paid, package shipped) › Looking up Account / Registration info › Social media (searching for comments/posts) Email Queries: What people search for?
  • 7.
    ▪  Standard two-phaseretrieval process: ›  First Phase: Retrieve a pool of message qualified as potentially relevant to the query •  Two matching models: –  Restricted (AND mode) –  Relaxed: any message containing at least one of the query terms in any of its fields is considered a match ›  Second phase: •  Ranks these messages using a rich set of features –  Scores messages by linear regression analysis learned using Learning-to-rank approach The Search Process
  • 8.
    REX - RelevanceEXtended Ranking Model Based on an LTR framework using several sets of features: ▪  Message ▪  Recipient ▪  Sender ▪  Message-Query Similarity
  • 9.
    Message Features ▪  Freshnessexponential decay over the message age ▪  User Actions replied, forwarded, flagged, drafted, read,.. ▪  Attachment has attachment, attachment type / size ▪  Folder folder type (inbox, draft, sent, user defined folder) ▪  Exchange Type reply/forward, in-thread
  • 10.
    Recipient Features ▪  Torecipient mentioned in To ▪  Cc recipient mentioned in Cc ▪  In Group recipient was not mentioned explicitly
  • 11.
    Sender Features ▪  User-senderconnection correspondence volume / type ▪  Self correspondence sender is user Vertical ▪  Sender inbound / outbound traffic volume and ratio ▪  Sender urls usage volume and ratio in messages ▪  Sender recipients number avg. per message ▪  Sender recipients actions ratio over messages Horizontal
  • 12.
    Message-Query Similarity Features ▪ BM25f textual similarity between a query and the entire message •  Considering query term distribution over message fields (Subject, From, To, Body, Attachment) ●  TF-IDF measures the (tf-idf) similarity of each message field independently of the others ●  Coord fraction of query terms that occur in the message
  • 13.
    Proximity Taking into accountproximity between query terms in content ▪  Neighborhood boosting consecutive matches ▪  Proximity boosting tokens found closely in a fixed window (5) with no ordering ▪  Prefix allowing prefix match but with score decay using length difference
  • 14.
    Learning to Rank(LTR) ▪  Data point: < query | ~100 matched messages | klicked message > ▪  Datasets: ›  Corporate 100K random queries from the corporate query log ›  Web-mail 10K random queries ›  Editorial 500 queries judged by editors ▪  LTR Algorithm AROW (Crammer et al. 2013)
  • 15.
    d1 d2 d3 d4 ∑wi fi(d4) ∑wifi(d1) >∑wi fi(d2) ∑wi fi(d3) d1 d2 d4 d3 ∑wi fi(d4) ∑wi fi(d1) > ∑wi fi(d2)∑wi fi(d4) update w Learning to Rank (LTR)
  • 16.
  • 17.
    Performance Measures ▪  Meanreciprocal rank (MRR) corresponds to the harmonic mean of the ranks of the relevant documents ▪  Success@K The number of queries for which the clicked message is found in the top-k results ▪  NDCG@K when we have several relevance feedback levels
  • 18.
    Time Vs. REX(Corporate Dataset) Algorithm MRR (+lift %) Time 0.3722 REX (fresh. + sim.) 0.4261 (+14.48%) REX (fresh. + sim. + actions) 0.4550 (+22.24%) REX (fresh. + sim. + actions + sender) 0.4548 (+22.19%)
  • 19.
    Time Vs. REX(Web Mail Dataset) Algorithm MRR (+lift %) Time 0.3717 REX (fresh. + sim.) 0.3785 (+1.81%) REX (fresh. + sim. + actions) 0.4238 (+14%) REX (fresh. + sim. + actions + sender) 0.4258 (+14.55%)
  • 20.
    Time vs REX(as a function of the Result set size) Relative improvement of REX over Time increases as more messages in the user inbox match the query
  • 21.
    Relative Feature Importance ▪ In general, REX ranker significantly outperforms Chronological ranker ›  both in the Corporate and in the Web datasets ▪  Relative Feature Importance: Freshness >> User actions >> Similarity >> Sender features ›  Freshness: •  Years >> Months >> Weeks >> Days ›  User actions •  Read >> Forwarded >> Flagged >> Replied >> Draft >> Ham >> Spam ›  Similarity: •  coord >> tf-idf (From > Subject > Body >Attachment > To) >> BM25f ▪  Low significance of the sender features ??
  • 22.
    Time Vs. REX(Editorial Dataset) Algorithm MRR (+lift %) NDCG@10 (+lift %) Time 0.3629 0.4936 REX 0.5105 (+40.65%) 0.6647 (+34.66%) Query Intent Algo A Most Relevant Algo A Related Algo B Most Relevant Algo B Related Lila dress Discussion about dress for party 2 3,5,7 4 1,2,6,7 Spense KE Schedule for Spense KE meeting 5 2,4,8,9 1 3,4,5
  • 23.
    Editors Feedback “... Sometimes,I had the feeling that Algo. B was really reading my mind to put in the first place exactly the email message I was thinking of ...” “...Today, after I ran it again, it was not that much impressive, but still I have the feeling it was the type of search that gave me the best results..."
  • 24.
    Email Search Tomorrow(REX ordered) Searching for an (old) application form for “Visa to India”
  • 25.
    Conclusions ●  We Challengedthe traditional chronological sort for email search o  While freshness is still super important, it should be integrated into the relevance model with many other important features o  REX performs significantly better from Time-based ranking o  The model can be easily expanded considering more signals as they become available ●  Are mail users ready to depart from chronological sort in favor of modern relevance ranking? ●  Time will tell ●  REX provides our users the opportunity, at least ●  More details can be found in our CIKM 2015 paper: o  Rank by Time or by Relevance? Revisiting Email Search
  • 26.
    Future Work ▪  Enrichingthe the set of Ranking features ›  Solving the mystery: •  how come that Sender features do not contribute to the ranking ›  Adding Query based features Based on Query Intent Analysis ▪  Personalization ›  Adding the User into the ranking model ▪  User Study ›  Better understanding user needs •  how users search over their mailboxes
  • 27.
    Yahoo new MobileMail application
  • 28.