Engineering challenges in vertical search engines

+

Engineering Challenges
in Vertical Search Engines
Aleksandar Bradic, Senior Director,
Engineering and R&D, Vast.com

+
Introduction

  Vertical Search
  Search focused on vertical data
  Vertical Data – data inherently described by it’s structure:
  Items/Properties for sale (Automotive, Real Estate..)

  Geographical Data (Neighborhoods, Locations..)
  Services (Hotels, Transportation..)
  Businesses (Restaurants, Nightlife..)
  Events (Concerts, Plays..)
  Auction items (Collectibles, Art..)
  Metadata (News, Social Data, Reviews..)
  …

+
Introduction

  Vertical Search != Full Text Search
  Full Text Search queries:
  “Cheap tickets for Broadway shows this week”
  “Trendy Restaurants in San Francisco near SoMa”
  “3-day trips from NYC to anywhere under $1000”
  Vertical Search queries:
  “price-sorted results bellow two standard deviations from tickets
category with Broadway as location and date range of 2010-04-11 to
2010-04-18”
  “distance-sorted results relative to center of SF/SoMa matching the
appropriate threshold of composite score of user review scores and
historical change in query/review volume”
  “total cost-sorted results for all 3-day intervals within next 6 months
combining hotel and airfare price bellow max value of $1000 for all
valid locations”

+
Introduction

  Vertical Search = search on structured data

  Vertical Search at Web-Scale:
  Web-Scale datasets
  Web-Scale query volumes
  Interactive operation
  Low latency requirements
  Utility maximization across all involved parties

  => loads of fun ! : )

+
@Vast.com

  Vast.com : Vertical Search & Analytics Platform

  Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
Airlines, etc..

+
@Vast.com

  Daily processing up to 1Tb of unstructured and semi-
structured Web data

  Managing ~150M records operational dataset across multiple
verticals

  Handling > 1000 query/sec peak search query loads

  We’re hiring ! : )

+
Challenges in Vertical Search
Engines
  Web Data Retrieval

  Unstructured Data

  Data Processing Infrastructures

  Vertical Search

  Data Analytics

  Computational Advertising

+
Web Data Retrieval

  Crawler Architecture
  Queue Management
  Crawl Ordering Policies
  Duplicate URL Detection
  Content Hash Management
  Politeness Management
  Coverage Measurement
  Freshness Optimization
  Incremental Crawling

+
Web Data Retrieval

  ”Deep Web” crawling
  Locating Deep Web Content Sources
  Selecting Relevant Sources
  Estimating Database Size
  Understanding Content / Form Detection
  Automatic Dispatch of HTML Forms
  Predicting content in free text forms
  Crawling non-HTML Content
  Estimating Query Result Sparsity
  URL Generation problem
  Query Covering Problem

+
Web Data Retrieval

  Focused (Topical) Crawling
  Content Classification
  Link Content Prediction
  Topic Relevance Estimation

  Modeling Temporal Characteristics
  Site-Level Evolution
  Page-Level Evolution

  Adversarial Crawling
  Web Spam Detection
  Cloaked Content Detection

+
Unstructured Data

  Unstructured Data – information that does not have a pre-
defined data model

  Handling Unstructured Data:
  Data Cleaning
  Tagging with Metadata
  Vertical Classification
  Schema Matching
  Information Extraction

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make model year trim price ???

+
Unstructured Data

  Information extraction from unstructured, ungrammatical
data
  Reference Sets - relational data sets that consist of collection of
known entities with associated common attributes
  Reference Set Selection
  Reference Set Generation
  Record Linkage : Finding “best matching” member of reference
set corresponding post
  Challenge : Automatic Generation of Reference Sets

+
Data Processing Infrastructures

  Infrastructures for continuous processing of unbounded streams
of unstructured data
  Information Extraction as part of processing (non-trivial
computation per each processed entry)

  Inherently distributed infrastructures - in order to support
performance and scalability

  Time-to-site constraints. Ability to process out-of band data.

  Support for complex operations on aggregated data (de-
duplication, static ranking, data enrichment, data cleaning/
filtering …)

  Support for data archival and off-line analysis

+

+

  Distributed Computing Platforms:

  Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

  Stream-oriented (Flume, S4, Stream SQL…)

  Distributed Data Stores (Dynamo/Cassandra/Riak…)

  The curse of CAP Theorem:
  It is impossible for a distributed system to simultaneously provide
all three of the following guarantees:
  Consistency
  Availability
  Partition tolerance

+
Vertical Search

  Large-Scale structured data search

  Providing both analytic and canonical set of Information
Retrieval functionalities

  Entries are represented in Vector Space Model

  Each result is represented as data point – tuple consisting of
appropriate number of fields :

(make, model, year, trim …)

+
Vertical Search

  Search in Vector Space Model
  Resulting subset generation
  Sorting as linearization using selected metric
  Dynamic subset criteria calculation
  Search Result Clustering
  “Similar” result search
  …

… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host

+
Vertical Search

  Faceted Search
  fac-et (fas’it) :
  1. One of the flat polished surfaces cut on a gemstone or occurring
naturally on a crystal.
  2. One of numerous aspects, as of a subject.

  Vocabulary problem for faceted data
  Facet Design / selection
  "the keywords that are assigned by indexers are often at
odds with those tried by searchers.”
  Selection of information-distinguishing facet values
  User-specific faceted search
  Dynamic correlated facet generation
  Distributing facet computation

+
Data Analytics

  Clickstream Data Analysis

  Learning from implicit user feedback

  Anonymous user clustering

  Learning to rank

  Inventory/Market Trends

  Rare Event detection

  Price Prediction

  Spam Content detection

+
Data Analytics

  Challenges:
  “Good Deal” detection
  Recommendation Systems for Vertical Data with no explicit user
feedback
  Accuracy of Automatic Valuation Models
  Data-driven feature design
  Click Prediction
  User Behavior Modeling

+
Computational Advertising

  The central problem of computational advertising is to find
the "best match" between a given user in a given context and a
suitable advertisement.

ads

ads

search results !

+

  Vertical Search presents an additional challenge in the sense
that any of the actual search results can be “sponsored”

ad ?

ad ?

+

  Central challenge:
  Find the “best match” between a given user in a given context
and a suitable advertisement
  “best match” – maximizing the value for :
  Users
  Advertisers
  Publishers
  Each of the parties has different set of utilities:
  Users want relevance

  Advertisers want ROI and volume
  Publishers want revenue per impression/search

+

  CTR (ClickThrough Rate Estimation):
  Reactive (statistically significant historical CTR)
  Predictive (CTR estimated from features of ads)
  Hybrid (historical + predictive)

  Personalization of CTR Computation ?
  Dynamic CTR Estimation (online algorithms)

P(click) = ?

+

  Analytical Aparatus:
  Regression Analysis (Linear, Logistic, probit model, High
Dimensional methods)
  Game Theory (Nash Equilibria, dominant strategy)
  Auction Theory (Vickrey, GSP, VCG…)
  Graph Theory (random walks on graphs, graph matching, etc.)
  Information Retrieval Techniques (similarity metrics, etc.)
  …

+
Conclusion

  Vertical Search & Analytics at Web Scale == fun !!!

  Source of large number of relevant research & engineering
problems !

  Opportunity to tackle wide spectra of techniques across all
areas of Computer Science and Engineering !

Jump on the bandwagon ! : )

Engineering challenges in vertical search engines

Recommended

Recommended

More Related Content

Similar to Engineering challenges in vertical search engines

Similar to Engineering challenges in vertical search engines (20)

More from ITDogadjaji.com

More from ITDogadjaji.com (20)

Recently uploaded

Recently uploaded (20)

Engineering challenges in vertical search engines