Hadoop - Integration Patterns and Practices__HadoopSummit2010

Hadoop Integration Patterns and Practices Eric Sammer Cloudera

The Problem Data Streams Databases Job Invocation and Workflow 2 Session Agenda

Tons of data (otherwise you’re in the wrong room) Tons of existing systems RDBMS Caches EDW Messaging Reporting Scheduling and job control Management and Monitoring 3 The Problem

Tons of data (otherwise you’re in the wrong room) Tons of existing systems RDBMS Caches EDW Messaging Reporting Scheduling and job control Management and Monitoring We’ll focus on streams, databases, and job control 4 The Problem

One of the most common integration cases Log data, transactional data, sensor data, user activity, … Two flavors Streams: boundaries not important ,[object Object],5 File Ingestion

For data streams use Flume Agents collect data from a source Collectors aggregate agent data Multiple sources: Text, “Tail”,Command line executables Multiple sinks including HDFS ,[object Object],Support for minor record transformation, output formats, multiple tiers of collection, dynamic configuration, and more. Open source 6 Stream Integration

Other uses of streams Streams are more than just logs JMS, AMQP messages: Wire tap and send to Flume Turn incremental updates into stream data to avoid DBMS “middleman” (or send to both) Many existing problems can be turned into asynchronous streams 7

Basic approach: queries (out), inserts (in) Works, but slow Beware the Hadoop -> RDBMS DDOS attack Manage transactions on inserts Smarter: Lower level export / import tools Go to text formats Think in batches (MR to convert text <-> SequenceFiles) Use Sqoop! 10 Relational Databases

Batch incoming edits Perform a MR job to apply updates Input: Original dataset + incoming batch(es) Group by record ID Secondary sort on timestamp descending Reducer selects the newest record or merges changes over time Represent delete operations using a DELETE surrogate record i.e. <timestamp>, <record id>, DELETE 11 Pattern: Incremental Merge

Jobs are usually triggered based on Time Data arrival Service Interface An external event Production systems must monitor for successful completion Jobs can fail (just like tasks) Build for job atomicity and clean recovery 12 Job Invocation

For complex chains of jobs use a workflow engine Most systems support different types of steps (e.g. Java MR, Pig, Hive, HDFS commands, shell scripts) Don’t write your own Hadoop specific: Oozie, Cascading, Azkaban, … General ETL: Spring Batch, Kettle, … 13 Workflow

“What was most interesting is that of the people using a homegrown system, only one said they were at all happy with it, and none would recommend their system.” Kevin D. Peterson http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html 14 Really, don’t write your own!

Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs 15 Example: Ingestion

Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs What’s so special here? 16 Example: Ingestion

Every N minutes process new data from /incoming Move data into a working directory based on timestamp Allows jobs to run concurrently Input is isolated; no duplicate processing On failure, move back into /incoming Transform records Split input data by day into separate outputs 17 Example: Ingestion

Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Nothing special. This is your logic. OK, your logic is special… you know what I mean. Split input data by day into separate outputs 18 Example: Ingestion

Every N minutes process new data from /incoming Move data into a working directory based on timestamp Transform records Split input data by day into separate outputs If we make no assumptions about input, we recover from previous failures Daily rollover no longer matters 19 Example: Ingestion

Data streams are everywhere; use Flume For bulk relational database import and export, use Sqoop Consider asynchronous updates to large data stores Incremental merges are possible Use a workflow system for complex jobs Job atomicity is critical ETL best practices still apply 20 Recap

Questions? email: esammer@cloudera.com twitter: @esammer / @cloudera freenode: #cloudera / #hadoop http://www.cloudera.com

Hadoop - Integration Patterns and Practices__HadoopSummit2010

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Hadoop - Integration Patterns and Practices__HadoopSummit2010

Similar to Hadoop - Integration Patterns and Practices__HadoopSummit2010 (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Hadoop - Integration Patterns and Practices__HadoopSummit2010

Editor's Notes