Scalable vertical search engine with hadoop


Published on

Published in: Technology, Business
1 Comment
  • Maybe you can check out HSearch - the real time, distributed open source search engine built on top of Hadoop and Hbase. Btw I am a developer of HSearch, so take it as a biased opinion.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scalable vertical search engine with hadoop

  1. 1. Hadoop use case: A scalablevertical search engine Iván de Prado Alonso, Datasalt Co-founder Twitter: @ivanprado
  2. 2. Content §  The problem §  The obvious solution §  When the obvious solution fails… §  … Hadoop comes to the rescue §  Advantages & disadvantages §  Improvements
  3. 3. ¿What is a vertical search engine? Provider 1 Vertical Search Engine Feed s rche Se aProvider 2 Sear ches ed Fe
  4. 4. Some of them
  5. 5. The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update DatabaseDownload & Process Insert/update Lucene/Solr Search Page Index
  6. 6. How it works §  Feed download §  For every register in the feed •  Check for existence in the DB •  If it exists and has changed, update ª The DB ª The Index •  If it doesn’t exist, insert into ª The DB ª The Index
  7. 7. How it works (II) §  The Database is used for •  Checking for register existence (avoiding duplicates) •  Managing the data with SQL facility §  Lucene/Solr is used for •  Quick searches •  Searching by structured fields •  Free-text searches •  Faceting
  8. 8. But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed
  9. 9. Huge jam!
  10. 10. “Swiss army knife of the 21st century” Media Guardian Innovation Awards
  11. 11. Hadoop “The Apache Hadoop software library is a framework that allows forthe distributed processing of large data sets acrossclusters of computers using a simple programming model” From Hadoop homepage
  12. 12. File System §  Distributed File System (HDFS) •  Cluster of nodes exposing their storage capacity •  Big blocks: 64 Mb •  Fault tolerant (replication) •  Big files storage
  13. 13. MapReduce §  Two functions (Map y Reduce) •  Map(k, v) : [z,w]* •  Reduce(z, w*) : [u, v]* §  Example: word count •  Map([document, null]) -> [word, 1]* •  Reduce(word, 1*) -> [word, total] §  MapReduce & SQL •  SELECT word, count(*) GROUP BY word §  Distributed execution on a cluster §  Horizontal scalability
  14. 14. Ok, that’s cool, but… ¿Howdoes it solve my problem?
  15. 15. Because… §  Hadoop is not a Database §  Hadoop “apparently” only processes data §  Hadoop does not allow “lookups” Hadoop is a paradigm shift difficult to assimilate
  16. 16. Architecture
  17. 17. Philosophy §  Always reprocess everything. ¡EVERYTHING! §  ¿Why? •  More bug tolerant •  More flexible •  More efficient. E.g.: ª  With a 7200 RPM HD –  Random IOPS – 100 –  Sequencial Read/Write – 40 MB/s –  Hypothesis: 5 Kb register size ª  … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. –  1 GB, 200.000 registers »  Sequential writing: 25 sg »  Random writing: 33 min!
  18. 18. Fetcher Feeds are downloaded and stored in the HDFS. §  MapReduce •  Input: [feed_url, null]* Reducer Task •  Mapper: identity •  Reducer(feed_url, Reducer Task HDFS null*) ª  Download the Reducer Task feed_url and store it in a HDFS folder
  19. 19. Processor Feeds are parsed, converted into documents and deduplicated §  MapReduce •  Input: [feed_path, null]* •  Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents •  Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª  A unique and global identifier is required (idProvider + idInternal) •  Output: [id, document]*
  20. 20. Processor (II) §  Possible problem: •  Very large feeds ª Does not scale, as one task will deal with the full feed. §  Solution •  Write a custom InputFormat that divides the feed in smaller pieces.
  21. 21. Serialization §  Writables •  Native Hadoop Serialization •  Low level API •  Basic types: IntWritable, Text, etc. §  Others •  Thrift, Avro, Protostuff •  Backwards compatibility
  22. 22. Indexer Production Solr Hot swapReducer Task Index - Shard 1 Index - Shard 1 Web ServerReducer Task Hot swap Index - Shard 2 Index - Shard 2Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
  23. 23. Indexer (II) §  SOLR-1301 • •  SolrOutputFormat •  1 index per reducer •  A custom Partitioner can be used to control where to place each document §  Another option •  Writing your own indexation code ª  By creating a custom output format ª  By Indexing at the reducer level. In each reduce call: –  Open an index –  Write all incoming registers –  Close the index
  24. 24. Search & Partitioning §  Different partitioning schemas •  Horizontal ª Each search involves all shards •  Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard §  Solr for index serving. Possibilities: ª Non federated Solr –  Only for vertical partitioning ª Distributed Solr ª Solr Cloud
  25. 25. Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le§  ¿How to register changes? •  Changes in price, features, etc. •  MapReduce: ª  Input: [id, document]* –  From last execution –  From current processing ª  Map: identity ª  Reduce(id, [document]*) : [id, document] –  Documents grouped by ID. New and old documents come together. –  New and old documents are compared. –  The relevant information is stored in the new document (e.g, the old price) –  Only the new document is emited. §  This is the closest thing in Hadoop to a DB
  26. 26. Advantages of the architecture §  Horizontal Scalability •  If properly programmed §  High tolerance to failures and bugs •  Always everything is reprocessed §  Flexible •  It is easy to do big changes §  High decoupling •  Indexes are the unique interaction between the back-end and the front-end •  Web servers can keep running even if the back- end is broken.
  27. 27. Disadvantages §  Batch processing •  No real-time or “near” real-time •  Update cycles of hours §  Completely different programming paradigm •  High learning curve
  28. 28. Improvements §  System for images §  Fuzzy duplicates detection §  Plasam: •  Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª  Implementing a by-pass to the Solrs ª  System for ensuring data consistency –  Without back jumps in time •  That combines the advantages of the proposed architecture but with near real time •  Datasalt has a prototype ready
  29. 29. Thanks! Ivan de Prado, @ivanprado
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.