The “obvious” architecture The first thing that comes to your mind Feed Does it exist? Has it changed? Insert/update DatabaseDownload & Process Insert/update Lucene/Solr Search Page Index
How it works § Feed download § For every register in the feed • Check for existence in the DB • If it exists and has changed, update ª The DB ª The Index • If it doesn’t exist, insert into ª The DB ª The Index
How it works (II) § The Database is used for • Checking for register existence (avoiding duplicates) • Managing the data with SQL facility § Lucene/Solr is used for • Quick searches • Searching by structured ﬁelds • Free-text searches • Faceting
But if things go well...… Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed Feed FeedFeed Feed Feed Feed Feed
“Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop
Hadoop “The Apache Hadoop software library is a framework that allows forthe distributed processing of large data sets acrossclusters of computers using a simple programming model” From Hadoop homepage
File System § Distributed File System (HDFS) • Cluster of nodes exposing their storage capacity • Big blocks: 64 Mb • Fault tolerant (replication) • Big ﬁles storage
MapReduce § Two functions (Map y Reduce) • Map(k, v) : [z,w]* • Reduce(z, w*) : [u, v]* § Example: word count • Map([document, null]) -> [word, 1]* • Reduce(word, 1*) -> [word, total] § MapReduce & SQL • SELECT word, count(*) GROUP BY word § Distributed execution on a cluster § Horizontal scalability
Ok, that’s cool, but… ¿Howdoes it solve my problem?
Because… § Hadoop is not a Database § Hadoop “apparently” only processes data § Hadoop does not allow “lookups” Hadoop is a paradigm shift diﬃcult to assimilate
Philosophy § Always reprocess everything. ¡EVERYTHING! § ¿Why? • More bug tolerant • More ﬂexible • More eﬃcient. E.g.: ª With a 7200 RPM HD – Random IOPS – 100 – Sequencial Read/Write – 40 MB/s – Hypothesis: 5 Kb register size ª … it is faster to rewrite all data than to perform random updates when more than 1.25% of the registers has changed. – 1 GB, 200.000 registers » Sequential writing: 25 sg » Random writing: 33 min!
Fetcher Feeds are downloaded and stored in the HDFS. § MapReduce • Input: [feed_url, null]* Reducer Task • Mapper: identity • Reducer(feed_url, Reducer Task HDFS null*) ª Download the Reducer Task feed_url and store it in a HDFS folder
Processor Feeds are parsed, converted into documents and deduplicated § MapReduce • Input: [feed_path, null]* • Map(feed_path, null) : [id, documents]* ª The feed is parsed and converted into documents • Reducer(id, [document]*): [id, document] ª Receives a list of documents and keeps the most recent one (deduplication) ª A unique and global identiﬁer is required (idProvider + idInternal) • Output: [id, document]*
Processor (II) § Possible problem: • Very large feeds ª Does not scale, as one task will deal with the full feed. § Solution • Write a custom InputFormat that divides the feed in smaller pieces.
Indexer Production Solr Hot swapReducer Task Index - Shard 1 Index - Shard 1 Web ServerReducer Task Hot swap Index - Shard 2 Index - Shard 2Reducer Task Web Server Hot swap Index - Shard 3 Index - Shard 3
Indexer (II) § SOLR-1301 • https://issues.apache.org/jira/browse/SOLR-1301 • SolrOutputFormat • 1 index per reducer • A custom Partitioner can be used to control where to place each document § Another option • Writing your own indexation code ª By creating a custom output format ª By Indexing at the reducer level. In each reduce call: – Open an index – Write all incoming registers – Close the index
Search & Partitioning § Diﬀerent partitioning schemas • Horizontal ª Each search involves all shards • Vertical: by ad type, country, etc. ª Searches can be restricted to the involved shard § Solr for index serving. Possibilities: ª Non federated Solr – Only for vertical partitioning ª Distributed Solr ª Solr Cloud
Reconciliation From Fetcher Reconciliation Next steps Reconciliated documents Last execution !le§ ¿How to register changes? • Changes in price, features, etc. • MapReduce: ª Input: [id, document]* – From last execution – From current processing ª Map: identity ª Reduce(id, [document]*) : [id, document] – Documents grouped by ID. New and old documents come together. – New and old documents are compared. – The relevant information is stored in the new document (e.g, the old price) – Only the new document is emited. § This is the closest thing in Hadoop to a DB
Advantages of the architecture § Horizontal Scalability • If properly programmed § High tolerance to failures and bugs • Always everything is reprocessed § Flexible • It is easy to do big changes § High decoupling • Indexes are the unique interaction between the back-end and the front-end • Web servers can keep running even if the back- end is broken.
Disadvantages § Batch processing • No real-time or “near” real-time • Update cycles of hours § Completely diﬀerent programming paradigm • High learning curve
Improvements § System for images § Fuzzy duplicates detection § Plasam: • Mixing this architecture with a by-pass system that provides near real time updates to the FE indexes ª Implementing a by-pass to the Solrs ª System for ensuring data consistency – Without back jumps in time • That combines the advantages of the proposed architecture but with near real time • Datasalt has a prototype ready
Thanks! Ivan de Prado, firstname.lastname@example.org @ivanprado