SlideShare a Scribd company logo
+




    Engineering Challenges
    in Vertical Search Engines
    Aleksandar Bradic, Senior Director,
    Engineering and R&D, Vast.com
+
    Introduction

        Vertical Search
             Search focused on vertical data
             Vertical Data – data inherently described by it’s structure:
                Items/Properties for sale (Automotive, Real Estate..)

                  Geographical Data (Neighborhoods, Locations..)
                  Services (Hotels, Transportation..)
                  Businesses (Restaurants, Nightlife..)
                  Events (Concerts, Plays..)
                  Auction items (Collectibles, Art..)
                  Metadata (News, Social Data, Reviews..)
                  …
+
    Introduction

        Vertical Search != Full Text Search
             Full Text Search queries:
                “Cheap tickets for Broadway shows this week”
                “Trendy Restaurants in San Francisco near SoMa”
                “3-day trips from NYC to anywhere under $1000”
             Vertical Search queries:
                “price-sorted results bellow two standard deviations from tickets
                 category with Broadway as location and date range of 2010-04-11 to
                 2010-04-18”
                “distance-sorted results relative to center of SF/SoMa matching the
                 appropriate threshold of composite score of user review scores and
                 historical change in query/review volume”
                “total cost-sorted results for all 3-day intervals within next 6 months
                 combining hotel and airfare price bellow max value of $1000 for all
                 valid locations”
+
    Introduction

        Vertical Search = search on structured data

        Vertical Search at Web-Scale:
             Web-Scale datasets
             Web-Scale query volumes
             Interactive operation
             Low latency requirements
             Utility maximization across all involved parties

        => loads of fun ! : )
+
    @Vast.com

        Vast.com : Vertical Search & Analytics Platform

        Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest
         Airlines, etc..
+
    @Vast.com

        Daily processing up to 1Tb of unstructured and semi-
         structured Web data

        Managing ~150M records operational dataset across multiple
         verticals

        Handling > 1000 query/sec peak search query loads



        We’re hiring ! : )
+
    Challenges in Vertical Search
    Engines
        Web Data Retrieval

        Unstructured Data

        Data Processing Infrastructures

        Vertical Search

        Data Analytics

        Computational Advertising
+
    Web Data Retrieval

        Crawler Architecture
             Queue Management
             Crawl Ordering Policies
             Duplicate URL Detection
             Content Hash Management
             Politeness Management
             Coverage Measurement
             Freshness Optimization
             Incremental Crawling
+
    Web Data Retrieval

        ”Deep Web” crawling
             Locating Deep Web Content Sources
             Selecting Relevant Sources
             Estimating Database Size
             Understanding Content / Form Detection
             Automatic Dispatch of HTML Forms
             Predicting content in free text forms
             Crawling non-HTML Content
             Estimating Query Result Sparsity
             URL Generation problem
             Query Covering Problem
+
    Web Data Retrieval

        Focused (Topical) Crawling
             Content Classification
             Link Content Prediction
             Topic Relevance Estimation

        Modeling Temporal Characteristics
             Site-Level Evolution
             Page-Level Evolution

        Adversarial Crawling
             Web Spam Detection
             Cloaked Content Detection
+
    Unstructured Data

        Unstructured Data – information that does not have a pre-
         defined data model

        Handling Unstructured Data:
             Data Cleaning
             Tagging with Metadata
             Vertical Classification
             Schema Matching
             Information Extraction


    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!

    Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!!
make            model   year    trim          price                  ???
+
    Unstructured Data

        Information extraction from unstructured, ungrammatical
         data
             Reference Sets - relational data sets that consist of collection of
              known entities with associated common attributes
             Reference Set Selection
             Reference Set Generation
             Record Linkage : Finding “best matching” member of reference
              set corresponding post
             Challenge : Automatic Generation of Reference Sets
+
    Data Processing Infrastructures

        Infrastructures for continuous processing of unbounded streams
         of unstructured data
        Information Extraction as part of processing (non-trivial
         computation per each processed entry)

        Inherently distributed infrastructures - in order to support
         performance and scalability

        Time-to-site constraints. Ability to process out-of band data.

        Support for complex operations on aggregated data (de-
         duplication, static ranking, data enrichment, data cleaning/
         filtering …)

        Support for data archival and off-line analysis
+
    Data Processing Infrastructures
+
    Data Processing Infrastructures

        Distributed Computing Platforms:

             Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)

             Stream-oriented (Flume, S4, Stream SQL…)

             Distributed Data Stores (Dynamo/Cassandra/Riak…)

        The curse of CAP Theorem:
             It is impossible for a distributed system to simultaneously provide
              all three of the following guarantees:
                Consistency
                Availability
                Partition tolerance
+
    Vertical Search

        Large-Scale structured data search

        Providing both analytic and canonical set of Information
         Retrieval functionalities

        Entries are represented in Vector Space Model

        Each result is represented as data point – tuple consisting of
         appropriate number of fields :

         (make, model, year, trim …)
+
    Vertical Search

        Search in Vector Space Model
             Resulting subset generation
             Sorting as linearization using selected metric
             Dynamic subset criteria calculation
             Search Result Clustering
             “Similar” result search
             …



… with up to ~100 ms milliseconds response time
… at 10M+ records in index
… handling 100+ queries/sec/host
+
    Vertical Search

        Faceted Search
             fac-et (fas’it) :
                1. One of the flat polished surfaces cut on a gemstone or occurring
                 naturally on a crystal.
                2. One of numerous aspects, as of a subject.


             Vocabulary problem for faceted data
             Facet Design / selection
                "the keywords that are assigned by indexers are often at
                  odds with those tried by searchers.”
                Selection of information-distinguishing facet values
             User-specific faceted search
             Dynamic correlated facet generation
             Distributing facet computation
+
    Data Analytics

        Clickstream Data Analysis

        Learning from implicit user feedback

        Anonymous user clustering

        Learning to rank

        Inventory/Market Trends

        Rare Event detection

        Price Prediction

        Spam Content detection
+
    Data Analytics

        Challenges:
             “Good Deal” detection
             Recommendation Systems for Vertical Data with no explicit user
              feedback
             Accuracy of Automatic Valuation Models
             Data-driven feature design
             Click Prediction
             User Behavior Modeling
+
    Computational Advertising

        The central problem of computational advertising is to find
         the "best match" between a given user in a given context and a
         suitable advertisement.




    ads


                                                                          ads




                                         search results !
+
    Computational Advertising

        Vertical Search presents an additional challenge in the sense
         that any of the actual search results can be “sponsored”




                                                                   ad ?




                                                                   ad ?
+
    Computational Advertising

        Central challenge:
             Find the “best match” between a given user in a given context
              and a suitable advertisement
             “best match” – maximizing the value for :
                  Users
                  Advertisers
                  Publishers
             Each of the parties has different set of utilities:
                Users want relevance

                  Advertisers want ROI and volume
                  Publishers want revenue per impression/search
+
    Computational Advertising

        CTR (ClickThrough Rate Estimation):
             Reactive (statistically significant historical CTR)
             Predictive (CTR estimated from features of ads)
             Hybrid (historical + predictive)


             Personalization of CTR Computation ?
             Dynamic CTR Estimation (online algorithms)




                                  P(click) = ?
+
    Computational Advertising

        Analytical Aparatus:
             Regression Analysis (Linear, Logistic, probit model, High
              Dimensional methods)
             Game Theory (Nash Equilibria, dominant strategy)
             Auction Theory (Vickrey, GSP, VCG…)
             Graph Theory (random walks on graphs, graph matching, etc.)
             Information Retrieval Techniques (similarity metrics, etc.)
             …
+
    Conclusion

        Vertical Search & Analytics at Web Scale == fun !!!

        Source of large number of relevant research & engineering
         problems !

        Opportunity to tackle wide spectra of techniques across all
         areas of Computer Science and Engineering !




                                       Jump on the bandwagon ! : )

More Related Content

Similar to Engineering challenges in vertical search engines

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
Amit Sheth
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
Yury Lifshits
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
Olha Hrytsay
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
KANIMOZHIUMA
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Fundación Ramón Areces
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Andre Freitas
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
Amit Sheth
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
Amazon Web Services
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
deep.bi
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy Cabral
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
Amazon Web Services
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
Alessandro Bozzon
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
Amazon Web Services
 
webmining overview
webmining overviewwebmining overview
webmining overview
abon
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
Bhaskar Krishnan
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
Parvathyparu25
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
deep.bi
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
Amit Sheth
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
aspoerri
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
Bill Slawski
 

Similar to Engineering challenges in vertical search engines (20)

SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITYSEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
SEMANTIC CONTENT MANAGEMENT FOR ENTERPRISES AND NATIONAL SECURITY
 
Data Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! ResearchData Cloud - Yury Lifshits - Yahoo! Research
Data Cloud - Yury Lifshits - Yahoo! Research
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Semantic Web Technologies
Semantic Web TechnologiesSemantic Web Technologies
Semantic Web Technologies
 
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarialÓscar Méndez - Big data: de la investigación científica a la gestión empresarial
Óscar Méndez - Big data: de la investigación científica a la gestión empresarial
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Applications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World TodayApplications of Semantic Technology in the Real World Today
Applications of Semantic Technology in the Real World Today
 
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
AWS re:Invent 2016: Leveraging Amazon Machine Learning, Amazon Redshift, and ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
Jeremy cabral search marketing summit - scraping data-driven content (1)
Jeremy cabral   search marketing summit - scraping data-driven content (1)Jeremy cabral   search marketing summit - scraping data-driven content (1)
Jeremy cabral search marketing summit - scraping data-driven content (1)
 
Building a Real-Time Geospatial-Aware Recommendation Engine
 Building a Real-Time Geospatial-Aware Recommendation Engine Building a Real-Time Geospatial-Aware Recommendation Engine
Building a Real-Time Geospatial-Aware Recommendation Engine
 
Liquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the WebLiquid Query: Multi-domain Exploratory Search on the Web
Liquid Query: Multi-domain Exploratory Search on the Web
 
Introduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWSIntroduction to Artificial Intelligence on AWS
Introduction to Artificial Intelligence on AWS
 
webmining overview
webmining overviewwebmining overview
webmining overview
 
Data Science, Personalisation & Product management
Data Science, Personalisation & Product managementData Science, Personalisation & Product management
Data Science, Personalisation & Product management
 
Data-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdfData-Mining-ppt (1).pdf
Data-Mining-ppt (1).pdf
 
Big Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website AnalyticsBig Data Explained - Case study: Website Analytics
Big Data Explained - Case study: Website Analytics
 
Semantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information SystemsSemantic Interoperability & Information Brokering in Global Information Systems
Semantic Interoperability & Information Brokering in Global Information Systems
 
SLA Nov2009 Public
SLA Nov2009 PublicSLA Nov2009 Public
SLA Nov2009 Public
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 

More from ITDogadjaji.com

Game Design 101
Game Design 101Game Design 101
Game Design 101
ITDogadjaji.com
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
ITDogadjaji.com
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
ITDogadjaji.com
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
ITDogadjaji.com
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
ITDogadjaji.com
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
ITDogadjaji.com
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event PresentationITDogadjaji.com
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
ITDogadjaji.com
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
ITDogadjaji.com
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
ITDogadjaji.com
 
Taksiko
TaksikoTaksiko
SiteCake
SiteCakeSiteCake
SiteCake
ITDogadjaji.com
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
ITDogadjaji.com
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
ITDogadjaji.com
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
ITDogadjaji.com
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
ITDogadjaji.com
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
ITDogadjaji.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
ITDogadjaji.com
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
ITDogadjaji.com
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
ITDogadjaji.com
 

More from ITDogadjaji.com (20)

Game Design 101
Game Design 101Game Design 101
Game Design 101
 
Uvod u Gejmifikaciju
Uvod u GejmifikacijuUvod u Gejmifikaciju
Uvod u Gejmifikaciju
 
Supporting clusters in Serbia
Supporting clusters in SerbiaSupporting clusters in Serbia
Supporting clusters in Serbia
 
Outsourcing Center Serbia
Outsourcing Center SerbiaOutsourcing Center Serbia
Outsourcing Center Serbia
 
ICT Clusters
ICT ClustersICT Clusters
ICT Clusters
 
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
Trends in Software Development: from Outsourcing to Crowdsourcing and Collabo...
 
How to Web 2011 Event Presentation
How to Web 2011 Event PresentationHow to Web 2011 Event Presentation
How to Web 2011 Event Presentation
 
Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities Panel intro: The European Startup: Opportunities
Panel intro: The European Startup: Opportunities
 
Mobipatrol
MobipatrolMobipatrol
Mobipatrol
 
Mediatoolkit
MediatoolkitMediatoolkit
Mediatoolkit
 
Taksiko
TaksikoTaksiko
Taksiko
 
SiteCake
SiteCakeSiteCake
SiteCake
 
ShoutEm - It's alright to pivot
ShoutEm - It's alright to pivotShoutEm - It's alright to pivot
ShoutEm - It's alright to pivot
 
How to (Win on the) Web
How to (Win on the) WebHow to (Win on the) Web
How to (Win on the) Web
 
How to deal with the media without screwing up
How to deal with the media without screwing upHow to deal with the media without screwing up
How to deal with the media without screwing up
 
VC 101: getting to first base
VC 101: getting to first baseVC 101: getting to first base
VC 101: getting to first base
 
birthdaysRock.com
birthdaysRock.combirthdaysRock.com
birthdaysRock.com
 
From Ljubljana into the world
From Ljubljana into the worldFrom Ljubljana into the world
From Ljubljana into the world
 
How to Web 2010 - Event presentation
How to Web 2010 - Event presentationHow to Web 2010 - Event presentation
How to Web 2010 - Event presentation
 
Ekspertlink
EkspertlinkEkspertlink
Ekspertlink
 

Recently uploaded

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 

Recently uploaded (20)

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 

Engineering challenges in vertical search engines

  • 1. + Engineering Challenges in Vertical Search Engines Aleksandar Bradic, Senior Director, Engineering and R&D, Vast.com
  • 2. + Introduction   Vertical Search   Search focused on vertical data   Vertical Data – data inherently described by it’s structure:   Items/Properties for sale (Automotive, Real Estate..)   Geographical Data (Neighborhoods, Locations..)   Services (Hotels, Transportation..)   Businesses (Restaurants, Nightlife..)   Events (Concerts, Plays..)   Auction items (Collectibles, Art..)   Metadata (News, Social Data, Reviews..)   …
  • 3. + Introduction   Vertical Search != Full Text Search   Full Text Search queries:   “Cheap tickets for Broadway shows this week”   “Trendy Restaurants in San Francisco near SoMa”   “3-day trips from NYC to anywhere under $1000”   Vertical Search queries:   “price-sorted results bellow two standard deviations from tickets category with Broadway as location and date range of 2010-04-11 to 2010-04-18”   “distance-sorted results relative to center of SF/SoMa matching the appropriate threshold of composite score of user review scores and historical change in query/review volume”   “total cost-sorted results for all 3-day intervals within next 6 months combining hotel and airfare price bellow max value of $1000 for all valid locations”
  • 4. + Introduction   Vertical Search = search on structured data   Vertical Search at Web-Scale:   Web-Scale datasets   Web-Scale query volumes   Interactive operation   Low latency requirements   Utility maximization across all involved parties   => loads of fun ! : )
  • 5. + @Vast.com   Vast.com : Vertical Search & Analytics Platform   Powering vertical search on Bing, Yahoo, AOL, KBB, Southwest Airlines, etc..
  • 6. + @Vast.com   Daily processing up to 1Tb of unstructured and semi- structured Web data   Managing ~150M records operational dataset across multiple verticals   Handling > 1000 query/sec peak search query loads   We’re hiring ! : )
  • 7. + Challenges in Vertical Search Engines   Web Data Retrieval   Unstructured Data   Data Processing Infrastructures   Vertical Search   Data Analytics   Computational Advertising
  • 8. + Web Data Retrieval   Crawler Architecture   Queue Management   Crawl Ordering Policies   Duplicate URL Detection   Content Hash Management   Politeness Management   Coverage Measurement   Freshness Optimization   Incremental Crawling
  • 9. + Web Data Retrieval   ”Deep Web” crawling   Locating Deep Web Content Sources   Selecting Relevant Sources   Estimating Database Size   Understanding Content / Form Detection   Automatic Dispatch of HTML Forms   Predicting content in free text forms   Crawling non-HTML Content   Estimating Query Result Sparsity   URL Generation problem   Query Covering Problem
  • 10. + Web Data Retrieval   Focused (Topical) Crawling   Content Classification   Link Content Prediction   Topic Relevance Estimation   Modeling Temporal Characteristics   Site-Level Evolution   Page-Level Evolution   Adversarial Crawling   Web Spam Detection   Cloaked Content Detection
  • 11. + Unstructured Data   Unstructured Data – information that does not have a pre- defined data model   Handling Unstructured Data:   Data Cleaning   Tagging with Metadata   Vertical Classification   Schema Matching   Information Extraction Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! Ford Focus 2008 Convertible just $7000.. Absolute Beauty !!!! make model year trim price ???
  • 12. + Unstructured Data   Information extraction from unstructured, ungrammatical data   Reference Sets - relational data sets that consist of collection of known entities with associated common attributes   Reference Set Selection   Reference Set Generation   Record Linkage : Finding “best matching” member of reference set corresponding post   Challenge : Automatic Generation of Reference Sets
  • 13. + Data Processing Infrastructures   Infrastructures for continuous processing of unbounded streams of unstructured data   Information Extraction as part of processing (non-trivial computation per each processed entry)   Inherently distributed infrastructures - in order to support performance and scalability   Time-to-site constraints. Ability to process out-of band data.   Support for complex operations on aggregated data (de- duplication, static ranking, data enrichment, data cleaning/ filtering …)   Support for data archival and off-line analysis
  • 14. + Data Processing Infrastructures
  • 15. + Data Processing Infrastructures   Distributed Computing Platforms:   Batch-oriented (MapReduce, Hadoop, BigTable, HBase…)   Stream-oriented (Flume, S4, Stream SQL…)   Distributed Data Stores (Dynamo/Cassandra/Riak…)   The curse of CAP Theorem:   It is impossible for a distributed system to simultaneously provide all three of the following guarantees:   Consistency   Availability   Partition tolerance
  • 16. + Vertical Search   Large-Scale structured data search   Providing both analytic and canonical set of Information Retrieval functionalities   Entries are represented in Vector Space Model   Each result is represented as data point – tuple consisting of appropriate number of fields : (make, model, year, trim …)
  • 17. + Vertical Search   Search in Vector Space Model   Resulting subset generation   Sorting as linearization using selected metric   Dynamic subset criteria calculation   Search Result Clustering   “Similar” result search   … … with up to ~100 ms milliseconds response time … at 10M+ records in index … handling 100+ queries/sec/host
  • 18. + Vertical Search   Faceted Search   fac-et (fas’it) :   1. One of the flat polished surfaces cut on a gemstone or occurring naturally on a crystal.   2. One of numerous aspects, as of a subject.   Vocabulary problem for faceted data   Facet Design / selection   "the keywords that are assigned by indexers are often at odds with those tried by searchers.”   Selection of information-distinguishing facet values   User-specific faceted search   Dynamic correlated facet generation   Distributing facet computation
  • 19. + Data Analytics   Clickstream Data Analysis   Learning from implicit user feedback   Anonymous user clustering   Learning to rank   Inventory/Market Trends   Rare Event detection   Price Prediction   Spam Content detection
  • 20. + Data Analytics   Challenges:   “Good Deal” detection   Recommendation Systems for Vertical Data with no explicit user feedback   Accuracy of Automatic Valuation Models   Data-driven feature design   Click Prediction   User Behavior Modeling
  • 21. + Computational Advertising   The central problem of computational advertising is to find the "best match" between a given user in a given context and a suitable advertisement. ads ads search results !
  • 22. + Computational Advertising   Vertical Search presents an additional challenge in the sense that any of the actual search results can be “sponsored” ad ? ad ?
  • 23. + Computational Advertising   Central challenge:   Find the “best match” between a given user in a given context and a suitable advertisement   “best match” – maximizing the value for :   Users   Advertisers   Publishers   Each of the parties has different set of utilities:   Users want relevance   Advertisers want ROI and volume   Publishers want revenue per impression/search
  • 24. + Computational Advertising   CTR (ClickThrough Rate Estimation):   Reactive (statistically significant historical CTR)   Predictive (CTR estimated from features of ads)   Hybrid (historical + predictive)   Personalization of CTR Computation ?   Dynamic CTR Estimation (online algorithms) P(click) = ?
  • 25. + Computational Advertising   Analytical Aparatus:   Regression Analysis (Linear, Logistic, probit model, High Dimensional methods)   Game Theory (Nash Equilibria, dominant strategy)   Auction Theory (Vickrey, GSP, VCG…)   Graph Theory (random walks on graphs, graph matching, etc.)   Information Retrieval Techniques (similarity metrics, etc.)   …
  • 26. + Conclusion   Vertical Search & Analytics at Web Scale == fun !!!   Source of large number of relevant research & engineering problems !   Opportunity to tackle wide spectra of techniques across all areas of Computer Science and Engineering ! Jump on the bandwagon ! : )