Scalable vertical search engine with hadoop
Upcoming SlideShare
Loading in...5
×
 

Scalable vertical search engine with hadoop

on

  • 9,903 views

 

Statistics

Views

Total Views
9,903
Views on SlideShare
5,538
Embed Views
4,365

Actions

Likes
10
Downloads
117
Comments
1

17 Embeds 4,365

http://www.datasalt.com 2794
http://cloud.dzone.com 1205
http://servicesangle.com 296
http://www.datasalt.co.uk 22
http://paper.li 21
http://translate.googleusercontent.com 6
http://feeds.feedburner.com 3
http://www.newsblur.com 3
http://server.dzone.com 3
http://www.dzone.com 2
http://a0.twimg.com 2
http://74.6.238.254 2
http://74.6.117.48 2
https://www.google.com 1
http://css.dzone.com 1
http://127.0.0.1 1
https://duckduckgo.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Maybe you can check out HSearch - the real time, distributed open source search engine built on top of Hadoop and Hbase. Btw I am a developer of HSearch, so take it as a biased opinion.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Scalable vertical search engine with hadoop Scalable vertical search engine with hadoop Presentation Transcript

    • Hadoop use case: A scalablevertical search engine Iván de Prado Alonso, Datasalt Co-founder Twitter: @ivanprado
    • Content §  The problem §  The obvious solution §  When the obvious solution fails… §  … Hadoop comes to the rescue §  Advantages & disadvantages §  Improvements
    • ¿What is a vertical search engine? Provider 1 Vertical Search Engine Feed s rche Se aProvider 2 Sear ches ed Fe
    • Some of them
    • The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update DatabaseDownload & Process Insert/update Lucene/Solr Search Page Index
    • How it works §  Feed download §  For every register in the feed •  Check for existence in the DB •  If it exists and has changed, update ª The DB ª The Index •  If it doesn’t exist, insert into ª The DB ª The Index
    • How it works (II) §  The Database is used for •  Checking for register existence (avoiding duplicates) •  Managing the data with SQL facility §  Lucene/Solr is used for •  Quick searches •  Searching by structured fields •  Free-text searches •  Faceting
    • But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed
    • Huge jam!
    • “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
    • Hadoop “The Apache Hadoop software library is a framework that allows forthe distributed processing of large data sets acrossclusters of computers using a simple programming model” From Hadoop homepage
    • File System §  Distributed File System (HDFS) •  Cluster of nodes exposing their storage capacity •  Big blocks: 64 Mb •  Fault tolerant (replication) •  Big files storage
    • MapReduce §  Two functions (Map y Reduce) •  Map(k, v) : [z,w]* •  Reduce(z, w*) : [u, v]* §  Example: word count •  Map([document, null]) -> [word, 1]* •  Reduce(word, 1*) -> [word, total] §  MapReduce & SQL •  SELECT word, count(*) GROUP BY word §  Distributed execution on a cluster §  Horizontal scalability
    • Ok, that’s cool, but… ¿Howdoes it solve my problem?
    • Because… §  Hadoop is not a Database §  Hadoop “apparently” only processes data §  Hadoop does not allow “lookups” Hadoop is a paradigm shift difficult to assimilate
    • Architecture
    • Philosophy §  Always reprocess everything. ¡EVERYTHING! §  ¿Why? •  More bug tolerant •  More flexible •  More efficient. E.g.: ª  With a 7200 RPM HD –  Random IOPS – 100 –  Sequencial Read/Write – 40 MB/s –  Hypothesis: 5 Kb register size ª  … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. –  1 GB, 200.000 registers »  Sequential writing: 25 sg »  Random writing: 33 min!
    • Fetcher Feeds are downloaded and stored in the HDFS. §  MapReduce •  Input: [feed_url, null]* Reducer Task •  Mapper: identity •  Reducer(feed_url, Reducer Task HDFS null*) ª  Download the Reducer Task feed_url and store it in a HDFS folder
    • Processor Feeds are parsed, converted into documents and deduplicated §  MapReduce •  Input: [feed_path, null]* •  Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents •  Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª  A unique and global identifier is required (idProvider + idInternal) •  Output: [id, document]*
    • Processor (II) §  Possible problem: •  Very large feeds ª Does not scale, as one task will deal with the full feed. §  Solution •  Write a custom InputFormat that divides the feed in smaller pieces.
    • Serialization §  Writables •  Native Hadoop Serialization •  Low level API •  Basic types: IntWritable, Text, etc. §  Others •  Thrift, Avro, Protostuff •  Backwards compatibility
    • Indexer Production Solr Hot swapReducer Task Index - Shard 1 Index - Shard 1 Web ServerReducer Task Hot swap Index - Shard 2 Index - Shard 2Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
    • Indexer (II) §  SOLR-1301 •  https://issues.apache.org/jira/browse/SOLR-1301 •  SolrOutputFormat •  1 index per reducer •  A custom Partitioner can be used to control where to place each document §  Another option •  Writing your own indexation code ª  By creating a custom output format ª  By Indexing at the reducer level. In each reduce call: –  Open an index –  Write all incoming registers –  Close the index
    • Search & Partitioning §  Different partitioning schemas •  Horizontal ª Each search involves all shards •  Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard §  Solr for index serving. Possibilities: ª Non federated Solr –  Only for vertical partitioning ª Distributed Solr ª Solr Cloud
    • Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le§  ¿How to register changes? •  Changes in price, features, etc. •  MapReduce: ª  Input: [id, document]* –  From last execution –  From current processing ª  Map: identity ª  Reduce(id, [document]*) : [id, document] –  Documents grouped by ID. New and old documents come together. –  New and old documents are compared. –  The relevant information is stored in the new document (e.g, the old price) –  Only the new document is emited. §  This is the closest thing in Hadoop to a DB
    • Advantages of the architecture §  Horizontal Scalability •  If properly programmed §  High tolerance to failures and bugs •  Always everything is reprocessed §  Flexible •  It is easy to do big changes §  High decoupling •  Indexes are the unique interaction between the back-end and the front-end •  Web servers can keep running even if the back- end is broken.
    • Disadvantages §  Batch processing •  No real-time or “near” real-time •  Update cycles of hours §  Completely different programming paradigm •  High learning curve
    • Improvements §  System for images §  Fuzzy duplicates detection §  Plasam: •  Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª  Implementing a by-pass to the Solrs ª  System for ensuring data consistency –  Without back jumps in time •  That combines the advantages of the proposed architecture but with near real time •  Datasalt has a prototype ready
    • Thanks! Ivan de Prado, ivan@datasalt.com @ivanprado