SlideShare a Scribd company logo
1 of 29
Download to read offline
Hadoop use case: A scalable
vertical search engine	
Iván de Prado Alonso, Datasalt Co-founder	
Twitter: @ivanprado
Content	

§  The problem	
§  The obvious solution	
§  When the obvious solution fails…	
§  … Hadoop comes to the rescue	
§  Advantages & disadvantages	
§  Improvements
¿What is a vertical search
             engine? 	
Provider 1

                     Vertical Search Engine


             Feed
                                                            s
                                                     rche
                                              Se a




Provider 2

                                               Sear
                                                       ches
                ed
              Fe
Some of them
The “obvious” architecture	
             The first thing that comes to your mind



   Feed

                 Does it exist?
                Has it changed?
                 Insert/update     Database
Download &
  Process
               Insert/update



                                  Lucene/Solr   Search Page
                                     Index
How it works
                               	
§  Feed download	
§  For every register in the feed	
   •  Check for existence in the DB	
   •  If it exists and has changed, update	
      ª The DB	
      ª The Index	
   •  If it doesn’t exist, insert into	
      ª The DB	
      ª The Index
How it works (II)
                              	
§  The Database is used for	
   •  Checking for register existence (avoiding
      duplicates)	
   •  Managing the data with SQL facility	
§  Lucene/Solr is used for	
   •    Quick searches	
   •    Searching by structured fields	
   •    Free-text searches	
   •    Faceting
But if things go well...…	


                                                                       Feed           Feed
                 Feed                 Feed
 Feed                                                           Feed

                                                         Feed                                Feed
                               Feed                                                  Feed
        Feed
                                             Feed                             Feed
                               Feed
Feed
                 Feed Feed                                Feed Feed
                                                       Feed                          Feed
                                      Feed                                                   Feed
                                                                                     Feed
                    Feed
                                                    Feed
                                                Feed                             Feed
                        Feed                                                                        Feed
Feed      Feed                                                  Feed
                                         Feed                                        Feed
Huge jam!
“Swiss army knife of the 21st
                                            century”	
                                                         	
Media Guardian Innovation Awards
                                                                                          	




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
Hadoop	
    “The Apache Hadoop
     software library is a
 framework that allows for
the distributed processing
  of large data sets across
clusters of computers using
   a simple programming
           model”  	
               From Hadoop homepage
File System	

§  Distributed File System (HDFS)	
  •  Cluster of nodes exposing their storage
     capacity	
  •  Big blocks: 64 Mb	
  •  Fault tolerant (replication)	
  •  Big files storage
MapReduce	
§  Two functions (Map y Reduce)	
   •  Map(k, v) : [z,w]*	
   •  Reduce(z, w*) : [u, v]*	
§  Example: word count	
   •  Map([document, null]) -> [word, 1]*	
   •  Reduce(word, 1*) -> [word, total]	
§  MapReduce & SQL	
   •  SELECT word, count(*) GROUP BY word	
§  Distributed execution on a cluster	
§  Horizontal scalability
Ok, that’s cool, but… ¿How
does it solve my problem?
Because…	

§  Hadoop is not a Database	
§  Hadoop “apparently” only
    processes data	
§  Hadoop does not allow “lookups”	

     Hadoop is a paradigm shift difficult to
                           	
                  assimilate
Architecture
Philosophy	
§  Always reprocess everything. ¡EVERYTHING!	
§  ¿Why?	
     •  More bug tolerant	
     •  More flexible	
     •  More efficient. E.g.:	
       ª    With a 7200 RPM HD	
               –  Random IOPS – 100 	
               –  Sequencial Read/Write – 40 MB/s	
               –  Hypothesis: 5 Kb register size	
       ª    … it is faster to rewrite all data than to perform random updates when
             more than 1.25% of the registers has changed. 	
               –  1 GB, 200.000 registers	
                     »  Sequential writing: 25 sg	
                     »  Random writing: 33 min!
Fetcher
                                   	
    Feeds are downloaded and stored in the HDFS.	

§  MapReduce	
   •  Input: [feed_url, null]*	
        Reducer Task



   •  Mapper: identity	
   •  Reducer(feed_url,                 Reducer Task
                                                       HDFS
      null*)	
       ª  Download the                 Reducer Task
         feed_url and store it
         in a HDFS folder
Processor
                          	
    Feeds are parsed, converted into documents and
                      deduplicated	
§  MapReduce	
  •  Input: [feed_path, null]*	
  •  Map(feed_path, null) : [id, documents]*	
     ª The feed is parsed and converted into documents	
  •  Reducer(id, [document]*): [id, document]	
     ª Receives a list of documents and keeps the most
        recent one (deduplication)	
     ª  A unique and global identifier is required
        (idProvider + idInternal)	
  •  Output: [id, document]*
Processor (II)
                              	

§  Possible problem:	
   •  Very large feeds	
      ª Does not scale, as one task will deal with the
        full feed. 	
§  Solution	
   •  Write a custom InputFormat that divides
      the feed in smaller pieces.
Serialization	

§  Writables	
   •  Native Hadoop Serialization	
   •  Low level API	
   •  Basic types: IntWritable, Text, etc.	
§  Others	
   •  Thrift, Avro, Protostuff	
   •  Backwards compatibility
Indexer
                             	
                                            Production Solr




                                 Hot swap
Reducer Task

                                            Index - Shard 1
               Index - Shard 1

                                                              Web Server
Reducer Task
                                 Hot swap

                                            Index - Shard 2
               Index - Shard 2

Reducer Task
                                                              Web Server
                                 Hot swap

                                            Index - Shard 3
               Index - Shard 3
Indexer (II)
                                    	
§  SOLR-1301	
   •    https://issues.apache.org/jira/browse/SOLR-1301	
   •    SolrOutputFormat	
   •    1 index per reducer	
   •    A custom Partitioner can be used to control where to
        place each document	
§  Another option	
   •  Writing your own indexation code	
         ª  By creating a custom output format	
         ª  By Indexing at the reducer level. In each reduce call:	
             –  Open an index	
             –  Write all incoming registers	
             –  Close the index
Search & Partitioning	
§  Different partitioning schemas	
   •  Horizontal	
      ª Each search involves all shards	
   •  Vertical: by ad type, country, etc.	
      ª Searches can be restricted to the involved shard	

§  Solr for index serving. Possibilities:	
      ª Non federated Solr	
          –  Only for vertical partitioning	
      ª Distributed Solr	
      ª Solr Cloud
Reconciliation	

                 From Fetcher              Reconciliation                                Next steps

                                                                       Reconciliated
                                                                        documents
                                         Last execution !le


§  ¿How to register changes?	
    •    Changes in price, features, etc.	
    •    MapReduce:	
           ª    Input: [id, document]*	
                     –  From last execution	
                     –  From current processing	
           ª    Map: identity	
           ª    Reduce(id, [document]*) : [id, document]	
                     –    Documents grouped by ID. New and old documents come together.	
                     –    New and old documents are compared.	
                     –    The relevant information is stored in the new document (e.g, the old price)	
                     –    Only the new document is emited.	
§  This is the closest thing in Hadoop to a DB
Advantages of the architecture	
§  Horizontal Scalability	
   •  If properly programmed	
§  High tolerance to failures and bugs	
   •  Always everything is reprocessed	
§  Flexible	
   •  It is easy to do big changes	
§  High decoupling	
   •  Indexes are the unique interaction between the
      back-end and the front-end	
   •  Web servers can keep running even if the back-
      end is broken.
Disadvantages
                        	

§  Batch processing	
  •  No real-time or “near” real-time	
  •  Update cycles of hours	
§  Completely different programming
    paradigm	
  •  High learning curve
Improvements
                           	
§  System for images	
§  Fuzzy duplicates detection	
§  Plasam:	
   •  Mixing this architecture with a by-pass system
      that provides near real time updates to the FE
      indexes	
      ª  Implementing a by-pass to the  Solrs	
      ª  System for ensuring data consistency	
          –  Without back jumps in time	
   •  That combines the advantages of the proposed
      architecture but with near real time	
   •  Datasalt has a prototype ready
Thanks!	
Ivan de Prado, 	
ivan@datasalt.com	
@ivanprado

More Related Content

Viewers also liked

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoopdatasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsdatasalt
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreducedatasalt
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyonddatasalt
 
Photo Contests 2012
Photo Contests 2012Photo Contests 2012
Photo Contests 2012Mihex
 
The Spirit of Barnabas
The Spirit of Barnabas The Spirit of Barnabas
The Spirit of Barnabas Wy Harris
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Marisa Gallagher
 
Sandwich Art
Sandwich ArtSandwich Art
Sandwich ArtMihex
 
Day 2 recycle grey water
Day 2 recycle grey waterDay 2 recycle grey water
Day 2 recycle grey watervigyanashram
 
Day 3 recycle grey water
Day 3  recycle grey waterDay 3  recycle grey water
Day 3 recycle grey watervigyanashram
 
The Pursuit of Busyness
The Pursuit of BusynessThe Pursuit of Busyness
The Pursuit of BusynessDevesh Pandey
 
Buñay llangari hector
Buñay llangari hectorBuñay llangari hector
Buñay llangari hectortoli976
 
Lua chon thuy san nhom 3 -11 a10-2010
Lua chon thuy san  nhom 3 -11 a10-2010Lua chon thuy san  nhom 3 -11 a10-2010
Lua chon thuy san nhom 3 -11 a10-2010Thuy AI Tran Thi
 
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!Blanca Flores
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeGYK Antler
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Benjamin Crucq
 

Viewers also liked (20)

Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Photo Contests 2012
Photo Contests 2012Photo Contests 2012
Photo Contests 2012
 
The Spirit of Barnabas
The Spirit of Barnabas The Spirit of Barnabas
The Spirit of Barnabas
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
 
Sandwich Art
Sandwich ArtSandwich Art
Sandwich Art
 
Day 2 recycle grey water
Day 2 recycle grey waterDay 2 recycle grey water
Day 2 recycle grey water
 
Day 3 recycle grey water
Day 3  recycle grey waterDay 3  recycle grey water
Day 3 recycle grey water
 
NajboljaMamaNaSvetu.com
NajboljaMamaNaSvetu.com NajboljaMamaNaSvetu.com
NajboljaMamaNaSvetu.com
 
Rethinking the mobile web
Rethinking the mobile webRethinking the mobile web
Rethinking the mobile web
 
The Pursuit of Busyness
The Pursuit of BusynessThe Pursuit of Busyness
The Pursuit of Busyness
 
Buñay llangari hector
Buñay llangari hectorBuñay llangari hector
Buñay llangari hector
 
Workbook sesion13
Workbook sesion13Workbook sesion13
Workbook sesion13
 
Wedge
WedgeWedge
Wedge
 
Lua chon thuy san nhom 3 -11 a10-2010
Lua chon thuy san  nhom 3 -11 a10-2010Lua chon thuy san  nhom 3 -11 a10-2010
Lua chon thuy san nhom 3 -11 a10-2010
 
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!OPORTUNIDAD!!  CASA REMODELADA ESTRENE YA!!
OPORTUNIDAD!! CASA REMODELADA ESTRENE YA!!
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
 

Similar to Scalable vertical search engine with hadoop

SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012Gigaom
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
To Host, Or Not To Host?
To Host, Or Not To Host?To Host, Or Not To Host?
To Host, Or Not To Host?Atlassian
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 PresentationCeph Community
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's ArchitectureTony Tam
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
RSS כיצד להשתמש ב
RSS כיצד להשתמש ב RSS כיצד להשתמש ב
RSS כיצד להשתמש ב reballattoun
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdSATOSHI TAGOMORI
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopEvans Ye
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Nathan Bijnens
 
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...Dustin Boeger
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructureelliando dias
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google wayEduard Hildebrandt
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Similar to Scalable vertical search engine with hadoop (20)

SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
SUPPORTING QUERYING ON MULTI-MILLION EVENTS PER SECOND from Structure:Data 2012
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
To Host, Or Not To Host?
To Host, Or Not To Host?To Host, Or Not To Host?
To Host, Or Not To Host?
 
Ceph LISA'12 Presentation
Ceph LISA'12 PresentationCeph LISA'12 Presentation
Ceph LISA'12 Presentation
 
Hadoop
HadoopHadoop
Hadoop
 
CloudOpen - 08/29/2012
CloudOpen - 08/29/2012CloudOpen - 08/29/2012
CloudOpen - 08/29/2012
 
Inside Wordnik's Architecture
Inside Wordnik's ArchitectureInside Wordnik's Architecture
Inside Wordnik's Architecture
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
RSS כיצד להשתמש ב
RSS כיצד להשתמש ב RSS כיצד להשתמש ב
RSS כיצד להשתמש ב
 
Distributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentdDistributed Stream Processing on Fluentd / #fluentd
Distributed Stream Processing on Fluentd / #fluentd
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
 
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...One Fish, Two Fish,  Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
One Fish, Two Fish, Red Fish, Dru-Fish - BADCamp Presentation on Conference ...
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Petabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructurePetabyte scale on commodity infrastructure
Petabyte scale on commodity infrastructure
 
Distributed computing the Google way
Distributed computing the Google wayDistributed computing the Google way
Distributed computing the Google way
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 

Recently uploaded

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityScyllaDB
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlPeter Udo Diehl
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessUXDXConf
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsUXDXConf
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfFIDO Alliance
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxJennifer Lim
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsStefano
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreelreely ones
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomCzechDreamin
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxEasyPrinterHelp
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxDavid Michel
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfEasyPrinterHelp
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101vincent683379
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераMark Opanasiuk
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceSamy Fodil
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekCzechDreamin
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...CzechDreamin
 

Recently uploaded (20)

Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
Buy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptxBuy Epson EcoTank L3210 Colour Printer Online.pptx
Buy Epson EcoTank L3210 Colour Printer Online.pptx
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 

Scalable vertical search engine with hadoop

  • 1. Hadoop use case: A scalable vertical search engine Iván de Prado Alonso, Datasalt Co-founder Twitter: @ivanprado
  • 2. Content §  The problem §  The obvious solution §  When the obvious solution fails… §  … Hadoop comes to the rescue §  Advantages & disadvantages §  Improvements
  • 3. ¿What is a vertical search engine? Provider 1 Vertical Search Engine Feed s rche Se a Provider 2 Sear ches ed Fe
  • 5. The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update Database Download & Process Insert/update Lucene/Solr Search Page Index
  • 6. How it works §  Feed download §  For every register in the feed •  Check for existence in the DB •  If it exists and has changed, update ª The DB ª The Index •  If it doesn’t exist, insert into ª The DB ª The Index
  • 7. How it works (II) §  The Database is used for •  Checking for register existence (avoiding duplicates) •  Managing the data with SQL facility §  Lucene/Solr is used for •  Quick searches •  Searching by structured fields •  Free-text searches •  Faceting
  • 8. But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed
  • 10. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  • 11. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” From Hadoop homepage
  • 12. File System §  Distributed File System (HDFS) •  Cluster of nodes exposing their storage capacity •  Big blocks: 64 Mb •  Fault tolerant (replication) •  Big files storage
  • 13. MapReduce §  Two functions (Map y Reduce) •  Map(k, v) : [z,w]* •  Reduce(z, w*) : [u, v]* §  Example: word count •  Map([document, null]) -> [word, 1]* •  Reduce(word, 1*) -> [word, total] §  MapReduce & SQL •  SELECT word, count(*) GROUP BY word §  Distributed execution on a cluster §  Horizontal scalability
  • 14. Ok, that’s cool, but… ¿How does it solve my problem?
  • 15. Because… §  Hadoop is not a Database §  Hadoop “apparently” only processes data §  Hadoop does not allow “lookups” Hadoop is a paradigm shift difficult to assimilate
  • 17. Philosophy §  Always reprocess everything. ¡EVERYTHING! §  ¿Why? •  More bug tolerant •  More flexible •  More efficient. E.g.: ª  With a 7200 RPM HD –  Random IOPS – 100 –  Sequencial Read/Write – 40 MB/s –  Hypothesis: 5 Kb register size ª  … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. –  1 GB, 200.000 registers »  Sequential writing: 25 sg »  Random writing: 33 min!
  • 18. Fetcher Feeds are downloaded and stored in the HDFS. §  MapReduce •  Input: [feed_url, null]* Reducer Task •  Mapper: identity •  Reducer(feed_url, Reducer Task HDFS null*) ª  Download the Reducer Task feed_url and store it in a HDFS folder
  • 19. Processor Feeds are parsed, converted into documents and deduplicated §  MapReduce •  Input: [feed_path, null]* •  Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents •  Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª  A unique and global identifier is required (idProvider + idInternal) •  Output: [id, document]*
  • 20. Processor (II) §  Possible problem: •  Very large feeds ª Does not scale, as one task will deal with the full feed. §  Solution •  Write a custom InputFormat that divides the feed in smaller pieces.
  • 21. Serialization §  Writables •  Native Hadoop Serialization •  Low level API •  Basic types: IntWritable, Text, etc. §  Others •  Thrift, Avro, Protostuff •  Backwards compatibility
  • 22. Indexer Production Solr Hot swap Reducer Task Index - Shard 1 Index - Shard 1 Web Server Reducer Task Hot swap Index - Shard 2 Index - Shard 2 Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
  • 23. Indexer (II) §  SOLR-1301 •  https://issues.apache.org/jira/browse/SOLR-1301 •  SolrOutputFormat •  1 index per reducer •  A custom Partitioner can be used to control where to place each document §  Another option •  Writing your own indexation code ª  By creating a custom output format ª  By Indexing at the reducer level. In each reduce call: –  Open an index –  Write all incoming registers –  Close the index
  • 24. Search & Partitioning §  Different partitioning schemas •  Horizontal ª Each search involves all shards •  Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard §  Solr for index serving. Possibilities: ª Non federated Solr –  Only for vertical partitioning ª Distributed Solr ª Solr Cloud
  • 25. Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le §  ¿How to register changes? •  Changes in price, features, etc. •  MapReduce: ª  Input: [id, document]* –  From last execution –  From current processing ª  Map: identity ª  Reduce(id, [document]*) : [id, document] –  Documents grouped by ID. New and old documents come together. –  New and old documents are compared. –  The relevant information is stored in the new document (e.g, the old price) –  Only the new document is emited. §  This is the closest thing in Hadoop to a DB
  • 26. Advantages of the architecture §  Horizontal Scalability •  If properly programmed §  High tolerance to failures and bugs •  Always everything is reprocessed §  Flexible •  It is easy to do big changes §  High decoupling •  Indexes are the unique interaction between the back-end and the front-end •  Web servers can keep running even if the back- end is broken.
  • 27. Disadvantages §  Batch processing •  No real-time or “near” real-time •  Update cycles of hours §  Completely different programming paradigm •  High learning curve
  • 28. Improvements §  System for images §  Fuzzy duplicates detection §  Plasam: •  Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª  Implementing a by-pass to the Solrs ª  System for ensuring data consistency –  Without back jumps in time •  That combines the advantages of the proposed architecture but with near real time •  Datasalt has a prototype ready
  • 29. Thanks! Ivan de Prado, ivan@datasalt.com @ivanprado