2. StubHub
• Mission to bring joy of live events to fans globally
• Acquired by eBay in 2007
• World’s largest ticket marketplace
• About 1 ticket is sold on StubHub every 1.3 seconds
• Every day, StubHub sends 80,000+ fans to events
• Present in 48 countries
• 200+ partnerships worldwide
• All 30 MLB teams
• NFL, NBA, NHL, MLS, NCAA, and others
4. Example Queries
• “Giants”
• “The the”
• “The white elephants”
• “Concerts this weekend”
• “Find events in San Francisco under $50”
5. Example Queries - Challenges
• “giants”
• entity disambiguation – New York Giants vs San Francisco Giants
• “the the”
• relevancy – more on this later
• “the white elephants”
• alias detection – a nickname for the Oakland Athletics
• “concerts this weekend”
• entity detection – concerts [category] this weekend [date/time]
• “find events in san francisco under $50”
• entity detection – find events [category] in san francisco [city] under
$50 [price]
6. Bag of Words approach
Example query: “Taylor Swift concerts”
• Tokenize: “Taylor”, ”Swift”, ”concerts”
• Remove stop words: “Taylor”, “Swift”, “concerts”
Problems:
• “Giants game” vs “the game”
• ”game” is a stop word in one case and an artist in the other
• “the the band”
• excluding “the” removes all information
• including “the” returns all results with “the” and “band”
7. Query Understanding &
Entity Detection
Making sense of the query and sending results to Solr
e.g., “find giants tickets this weekend at at&t park”
• find giants [Performer] tickets this weekend [date] at at&t park
[venue]
• “giants” -> PerformerId:197
• “this weekend” -> [2018-09-08T00:00:00.000Z TO 2018-09-
10T00:00:00.000Z]
• “at&t park” -> VenueId: 82
8. Ambiguity
• Conflicts between entities:
• “bruno mars weekend”
• ”red sky july”
• “steve march”
• ”steve march”
• “steve [performer] march [date]” or “steve march [performer]”
• Solution: more user queries
• Bootstrapping and encouraging user behavior with a conservative
approach
13. Query Classification
• Differentiate between “precise” and “conversational” queries
• Precise -> “giants this weekend”
• Conversational -> “find me a giants game in new york
happening this weekend”
• WEKA Naïve Bayesian classifier
• Accuracy of 96% on generated queries, with spot-checking on
a few randomly selected queries
14. Rule-based Entity Detection
• Based on predefined rules or patterns and lookup
• Not particularly accurate ~70%
• Conservative approach - does not return many false positives
• e.g.,
• PERFORMER, CONJUNCTION, PERFORMER, PRICE
“sf giants vs Oakland a’s under 30”
• PERFORMER, DATE, PRICE
“maroon 5 next month under $25”
• PERFORMER, PRICE
“foo fighters under $200"
• UNKNOWN, DATE, PRICE
“tickets for this weekend under $20”
15. Stanford NLP &
Conditional Random Fields (CRF)
Find me AT&T Parkatthis weekendgiants
UNKNOWN VENUEUNKNOWNDATE/TIMEPERFORMER
16. Training
• Gazettes -> List of entities
• Features
• shape features -> n-grams
• use ordinals
• use class features
• order of CRF
• use word
• use date range
• gazette features
• 27 features in total
• 95% accuracy on generated queries*
17. Performer Disambiguation
• giants -> San Francisco Giants, New York Giants, San Jose Giants etc.
• User click count data from suggestions based on location.
18. Alias Detection
• Alias generation on index side
• e.g., “the white elephants” for Oakland Athletics
• e.g., “the boys from the bay” for San Francisco Giants
• Conservative approach to alias generation
Give more context prior to introducing these examples (e.g., StubHub has the largest catalog of performers and events, and thus faces a unique problem)
Shape features: bigram / trigram / etc
Ordinals: first, second, third
Class features: label of previous word (i.e., entity type)
Order of CRF: how many words to look at (order=2 means use two words)
Use word: e.g., giants almost always the performer, so give a bias towards performer
Use date range: e.g., this weekend, in October, etc.
Gazette features: list of entities that we support