+ Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
+ Introduction Vertical Search Search focused on vertical data Vertical Data – data inherently described by it’s structure: Items/Properties for sale (Automotive, Real Estate..) Geographical Data (Neighborhoods, Locations..) Services (Hotels, Transportation..) Businesses (Restaurants, Nightlife..) Events (Concerts, Plays..) Auction items (Collectibles, Art..) Metadata (News, Social Data, Reviews..) …
+ Introduction Vertical Search != Full Text Search Full Text Search queries: “Cheap tickets for Broadway shows this week” “Trendy Restaurants in San Francisco near SoMa” “3-day trips from NYC to anywhere under $1000” Vertical Search queries: “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18” “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume” “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
+ Introduction Vertical Search = search on structured data Vertical Search at Web-Scale: Web-Scale datasets Web-Scale query volumes Interactive operation Low latency requirements Utility maximization across all involved parties => loads of fun ! : )
+ @Vast.com Daily processing up to 1Tb of unstructured and semi- structured Web data Managing ~150M records operational dataset across multiple verticals Handling > 1000 query/sec peak search query loads We’re hiring ! : )
+ Challenges in Vertical Search Engines Web Data Retrieval Unstructured Data Data Processing Infrastructures Vertical Search Data Analytics Computational Advertising
+ Web Data Retrieval ”Deep Web” crawling Locating Deep Web Content Sources Selecting Relevant Sources Estimating Database Size Understanding Content / Form Detection Automatic Dispatch of HTML Forms Predicting content in free text forms Crawling non-HTML Content Estimating Query Result Sparsity URL Generation problem Query Covering Problem
+ Unstructured Data Unstructured Data – information that does not have a pre- defined data model Handling Unstructured Data: Data Cleaning Tagging with Metadata Vertical Classification Schema Matching Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!make model year trim price ???
+ Unstructured Data Information extraction from unstructured, ungrammatical data Reference Sets - relational data sets that consist of collection of known entities with associated common attributes Reference Set Selection Reference Set Generation Record Linkage : Finding “best matching” member of reference set corresponding post Challenge : Automatic Generation of Reference Sets
+ Data Processing Infrastructures Infrastructures for continuous processing of unbounded streams of unstructured data Information Extraction as part of processing (non-trivial computation per each processed entry) Inherently distributed infrastructures - in order to support performance and scalability Time-to-site constraints. Ability to process out-of band data. Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …) Support for data archival and off-line analysis
+ Data Processing Infrastructures Distributed Computing Platforms: Batch-oriented (MapReduce, Hadoop, BigTable, HBase…) Stream-oriented (Flume, S4, Stream SQL…) Distributed Data Stores (Dynamo/Cassandra/Riak…) The curse of CAP Theorem: It is impossible for a distributed system to simultaneously provide all three of the following guarantees: Consistency Availability Partition tolerance
+ Vertical Search Large-Scale structured data search Providing both analytic and canonical set of Information Retrieval functionalities Entries are represented in Vector Space Model Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
+ Vertical Search Search in Vector Space Model Resulting subset generation Sorting as linearization using selected metric Dynamic subset criteria calculation Search Result Clustering “Similar” result search …… with up to ~100 ms milliseconds response time… at 10M+ records in index… handling 100+ queries/sec/host
+ Vertical Search Faceted Search fac-et (fas’it) : 1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal. 2. One of numerous aspects, as of a subject. Vocabulary problem for faceted data Facet Design / selection "the keywords that are assigned by indexers are often at odds with those tried by searchers.” Selection of information-distinguishing facet values User-specific faceted search Dynamic correlated facet generation Distributing facet computation
+ Data Analytics Clickstream Data Analysis Learning from implicit user feedback Anonymous user clustering Learning to rank Inventory/Market Trends Rare Event detection Price Prediction Spam Content detection
+ Data Analytics Challenges: “Good Deal” detection Recommendation Systems for Vertical Data with no explicit user feedback Accuracy of Automatic Valuation Models Data-driven feature design Click Prediction User Behavior Modeling
+ Computational Advertising The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
+ Computational Advertising Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
+ Computational Advertising Central challenge: Find the “best match” between a given user in a given context and a suitable advertisement “best match” – maximizing the value for : Users Advertisers Publishers Each of the parties has different set of utilities: Users want relevance Advertisers want ROI and volume Publishers want revenue per impression/search
+ Computational Advertising Analytical Aparatus: Regression Analysis (Linear, Logistic, probit model, High Dimensional methods) Game Theory (Nash Equilibria, dominant strategy) Auction Theory (Vickrey, GSP, VCG…) Graph Theory (random walks on graphs, graph matching, etc.) Information Retrieval Techniques (similarity metrics, etc.) …
+ Conclusion Vertical Search & Analytics at Web Scale == fun !!! Source of large number of relevant research & engineering problems ! Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )