Your SlideShare is downloading. ×
0
Hadoop use case: A scalablevertical search engine	Iván de Prado Alonso, Datasalt Co-founder	Twitter: @ivanprado
Content	§  The problem	§  The obvious solution	§  When the obvious solution fails…	§  … Hadoop comes to the rescue	§ ...
¿What is a vertical search             engine? 	Provider 1                     Vertical Search Engine             Feed    ...
Some of them
The “obvious” architecture	             The first thing that comes to your mind   Feed                 Does it exist?     ...
How it works                               	§  Feed download	§  For every register in the feed	   •  Check for existence...
How it works (II)                              	§  The Database is used for	   •  Checking for register existence (avoidi...
But if things go well...…	                                                                       Feed           Feed      ...
Huge jam!
“Swiss army knife of the 21st                                            century”	                                        ...
Hadoop	    “The Apache Hadoop     software library is a framework that allows forthe distributed processing  of large data...
File System	§  Distributed File System (HDFS)	  •  Cluster of nodes exposing their storage     capacity	  •  Big blocks: ...
MapReduce	§  Two functions (Map y Reduce)	   •  Map(k, v) : [z,w]*	   •  Reduce(z, w*) : [u, v]*	§  Example: word count	...
Ok, that’s cool, but… ¿Howdoes it solve my problem?
Because…	§  Hadoop is not a Database	§  Hadoop “apparently” only    processes data	§  Hadoop does not allow “lookups”	 ...
Architecture
Philosophy	§  Always reprocess everything. ¡EVERYTHING!	§  ¿Why?	     •  More bug tolerant	     •  More flexible	     •  ...
Fetcher                                   	    Feeds are downloaded and stored in the HDFS.	§  MapReduce	   •  Input: [fe...
Processor                          	    Feeds are parsed, converted into documents and                      deduplicated	§...
Processor (II)                              	§  Possible problem:	   •  Very large feeds	      ª Does not scale, as one ...
Serialization	§  Writables	   •  Native Hadoop Serialization	   •  Low level API	   •  Basic types: IntWritable, Text, et...
Indexer                             	                                            Production Solr                          ...
Indexer (II)                                    	§  SOLR-1301	   •    https://issues.apache.org/jira/browse/SOLR-1301	   ...
Search & Partitioning	§  Different partitioning schemas	   •  Horizontal	      ª Each search involves all shards	   •  Ve...
Reconciliation	                 From Fetcher              Reconciliation                                Next steps        ...
Advantages of the architecture	§  Horizontal Scalability	   •  If properly programmed	§  High tolerance to failures and ...
Disadvantages                        	§  Batch processing	  •  No real-time or “near” real-time	  •  Update cycles of hou...
Improvements                           	§  System for images	§  Fuzzy duplicates detection	§  Plasam:	   •  Mixing this...
Thanks!	Ivan de Prado, 	ivan@datasalt.com	@ivanprado
Upcoming SlideShare
Loading in...5
×

Scalable vertical search engine with hadoop

10,681

Published on

Published in: Technology, Business
1 Comment
10 Likes
Statistics
Notes
  • Maybe you can check out HSearch - the real time, distributed open source search engine built on top of Hadoop and Hbase. Btw I am a developer of HSearch, so take it as a biased opinion.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
10,681
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
137
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

Transcript of "Scalable vertical search engine with hadoop"

  1. 1. Hadoop use case: A scalablevertical search engine Iván de Prado Alonso, Datasalt Co-founder Twitter: @ivanprado
  2. 2. Content §  The problem §  The obvious solution §  When the obvious solution fails… §  … Hadoop comes to the rescue §  Advantages & disadvantages §  Improvements
  3. 3. ¿What is a vertical search engine? Provider 1 Vertical Search Engine Feed s rche Se aProvider 2 Sear ches ed Fe
  4. 4. Some of them
  5. 5. The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update DatabaseDownload & Process Insert/update Lucene/Solr Search Page Index
  6. 6. How it works §  Feed download §  For every register in the feed •  Check for existence in the DB •  If it exists and has changed, update ª The DB ª The Index •  If it doesn’t exist, insert into ª The DB ª The Index
  7. 7. How it works (II) §  The Database is used for •  Checking for register existence (avoiding duplicates) •  Managing the data with SQL facility §  Lucene/Solr is used for •  Quick searches •  Searching by structured fields •  Free-text searches •  Faceting
  8. 8. But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed
  9. 9. Huge jam!
  10. 10. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
  11. 11. Hadoop “The Apache Hadoop software library is a framework that allows forthe distributed processing of large data sets acrossclusters of computers using a simple programming model” From Hadoop homepage
  12. 12. File System §  Distributed File System (HDFS) •  Cluster of nodes exposing their storage capacity •  Big blocks: 64 Mb •  Fault tolerant (replication) •  Big files storage
  13. 13. MapReduce §  Two functions (Map y Reduce) •  Map(k, v) : [z,w]* •  Reduce(z, w*) : [u, v]* §  Example: word count •  Map([document, null]) -> [word, 1]* •  Reduce(word, 1*) -> [word, total] §  MapReduce & SQL •  SELECT word, count(*) GROUP BY word §  Distributed execution on a cluster §  Horizontal scalability
  14. 14. Ok, that’s cool, but… ¿Howdoes it solve my problem?
  15. 15. Because… §  Hadoop is not a Database §  Hadoop “apparently” only processes data §  Hadoop does not allow “lookups” Hadoop is a paradigm shift difficult to assimilate
  16. 16. Architecture
  17. 17. Philosophy §  Always reprocess everything. ¡EVERYTHING! §  ¿Why? •  More bug tolerant •  More flexible •  More efficient. E.g.: ª  With a 7200 RPM HD –  Random IOPS – 100 –  Sequencial Read/Write – 40 MB/s –  Hypothesis: 5 Kb register size ª  … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. –  1 GB, 200.000 registers »  Sequential writing: 25 sg »  Random writing: 33 min!
  18. 18. Fetcher Feeds are downloaded and stored in the HDFS. §  MapReduce •  Input: [feed_url, null]* Reducer Task •  Mapper: identity •  Reducer(feed_url, Reducer Task HDFS null*) ª  Download the Reducer Task feed_url and store it in a HDFS folder
  19. 19. Processor Feeds are parsed, converted into documents and deduplicated §  MapReduce •  Input: [feed_path, null]* •  Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents •  Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª  A unique and global identifier is required (idProvider + idInternal) •  Output: [id, document]*
  20. 20. Processor (II) §  Possible problem: •  Very large feeds ª Does not scale, as one task will deal with the full feed. §  Solution •  Write a custom InputFormat that divides the feed in smaller pieces.
  21. 21. Serialization §  Writables •  Native Hadoop Serialization •  Low level API •  Basic types: IntWritable, Text, etc. §  Others •  Thrift, Avro, Protostuff •  Backwards compatibility
  22. 22. Indexer Production Solr Hot swapReducer Task Index - Shard 1 Index - Shard 1 Web ServerReducer Task Hot swap Index - Shard 2 Index - Shard 2Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
  23. 23. Indexer (II) §  SOLR-1301 •  https://issues.apache.org/jira/browse/SOLR-1301 •  SolrOutputFormat •  1 index per reducer •  A custom Partitioner can be used to control where to place each document §  Another option •  Writing your own indexation code ª  By creating a custom output format ª  By Indexing at the reducer level. In each reduce call: –  Open an index –  Write all incoming registers –  Close the index
  24. 24. Search & Partitioning §  Different partitioning schemas •  Horizontal ª Each search involves all shards •  Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard §  Solr for index serving. Possibilities: ª Non federated Solr –  Only for vertical partitioning ª Distributed Solr ª Solr Cloud
  25. 25. Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le§  ¿How to register changes? •  Changes in price, features, etc. •  MapReduce: ª  Input: [id, document]* –  From last execution –  From current processing ª  Map: identity ª  Reduce(id, [document]*) : [id, document] –  Documents grouped by ID. New and old documents come together. –  New and old documents are compared. –  The relevant information is stored in the new document (e.g, the old price) –  Only the new document is emited. §  This is the closest thing in Hadoop to a DB
  26. 26. Advantages of the architecture §  Horizontal Scalability •  If properly programmed §  High tolerance to failures and bugs •  Always everything is reprocessed §  Flexible •  It is easy to do big changes §  High decoupling •  Indexes are the unique interaction between the back-end and the front-end •  Web servers can keep running even if the back- end is broken.
  27. 27. Disadvantages §  Batch processing •  No real-time or “near” real-time •  Update cycles of hours §  Completely different programming paradigm •  High learning curve
  28. 28. Improvements §  System for images §  Fuzzy duplicates detection §  Plasam: •  Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª  Implementing a by-pass to the Solrs ª  System for ensuring data consistency –  Without back jumps in time •  That combines the advantages of the proposed architecture but with near real time •  Datasalt has a prototype ready
  29. 29. Thanks! Ivan de Prado, ivan@datasalt.com @ivanprado
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×