Hadoop - Integration Patterns and Practices__HadoopSummit2010

4,166 views
4,026 views

Published on

Hadoop Summit 2010 - Application Track
Hadoop - Integration Patterns and Practices
Eric Sammer, Cloudera

Published in: Technology
0 Comments
11 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
4,166
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
11
Embeds 0
No embeds

No notes for slide
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk.Please post your contact information here.
  • Hadoop - Integration Patterns and Practices__HadoopSummit2010

    1. 1. Hadoop Integration Patterns and Practices<br />Eric Sammer<br />Cloudera<br />
    2. 2. The Problem<br />Data Streams<br />Databases<br />Job Invocation and Workflow<br />2<br />Session Agenda<br />
    3. 3. Tons of data (otherwise you’re in the wrong room)<br />Tons of existing systems<br />RDBMS<br />Caches<br />EDW<br />Messaging<br />Reporting<br />Scheduling and job control<br />Management and Monitoring<br />3<br />The Problem<br />
    4. 4. Tons of data (otherwise you’re in the wrong room)<br />Tons of existing systems<br />RDBMS<br />Caches<br />EDW<br />Messaging<br />Reporting<br />Scheduling and job control<br />Management and Monitoring<br />We’ll focus on streams, databases, and job control<br />4<br />The Problem<br />
    5. 5. One of the most common integration cases<br />Log data, transactional data, sensor data, user activity, …<br />Two flavors<br />Streams: boundaries not important<br /><ul><li>Atomic units: boundaries indicate something</li></ul>5<br />File Ingestion<br />
    6. 6. For data streams use Flume<br />Agents collect data from a source<br />Collectors aggregate agent data<br />Multiple sources: Text, “Tail”,Command line executables<br />Multiple sinks including HDFS<br /><ul><li>Configurable durability (speed vs. safety)</li></ul>Support for minor record transformation, output formats, multiple tiers of collection, dynamic configuration, and more.<br />Open source<br />6<br />Stream Integration<br />
    7. 7. Other uses of streams<br />Streams are more than just logs<br />JMS, AMQP messages: Wire tap and send to Flume<br />Turn incremental updates into stream data to avoid DBMS “middleman” (or send to both)<br />Many existing problems can be turned into asynchronous streams<br />7<br />
    8. 8. One option…<br />8<br />
    9. 9. Better: Async Writer<br />9<br />
    10. 10. Basic approach: queries (out), inserts (in)<br />Works, but slow<br />Beware the Hadoop -> RDBMS DDOS attack<br />Manage transactions on inserts<br />Smarter: Lower level export / import tools<br />Go to text formats<br />Think in batches (MR to convert text <-> SequenceFiles)<br />Use Sqoop!<br />10<br />Relational Databases<br />
    11. 11. Batch incoming edits<br />Perform a MR job to apply updates<br />Input: Original dataset + incoming batch(es)<br />Group by record ID<br />Secondary sort on timestamp descending<br />Reducer selects the newest record or merges changes over time<br />Represent delete operations using a DELETE surrogate record<br />i.e. <timestamp>, <record id>, DELETE<br />11<br />Pattern: Incremental Merge<br />
    12. 12. Jobs are usually triggered based on<br />Time<br />Data arrival<br />Service Interface<br />An external event<br />Production systems must monitor for successful completion<br />Jobs can fail (just like tasks)<br />Build for job atomicity and clean recovery<br />12<br />Job Invocation<br />
    13. 13. For complex chains of jobs use a workflow engine<br />Most systems support different types of steps (e.g. Java MR, Pig, Hive, HDFS commands, shell scripts)<br />Don’t write your own<br />Hadoop specific: Oozie, Cascading, Azkaban, …<br />General ETL: Spring Batch, Kettle, …<br />13<br />Workflow<br />
    14. 14. “What was most interesting is that of the people using a homegrown system, only one said they were at all happy with it, and none would recommend their system.”<br /> Kevin D. Peterson<br />http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html<br />14<br />Really, don’t write your own!<br />
    15. 15. Every N minutes process new data from /incoming<br />Move data into a working directory based on timestamp<br />Transform records<br />Split input data by day into separate outputs<br />15<br />Example: Ingestion<br />
    16. 16. Every N minutes process new data from /incoming<br />Move data into a working directory based on timestamp<br />Transform records<br />Split input data by day into separate outputs<br />What’s so special here?<br />16<br />Example: Ingestion<br />
    17. 17. Every N minutes process new data from /incoming<br />Move data into a working directory based on timestamp<br />Allows jobs to run concurrently<br />Input is isolated; no duplicate processing<br />On failure, move back into /incoming<br />Transform records<br />Split input data by day into separate outputs<br />17<br />Example: Ingestion<br />
    18. 18. Every N minutes process new data from /incoming<br />Move data into a working directory based on timestamp<br />Transform records<br />Nothing special. This is your logic. OK, your logic is special… you know what I mean.<br />Split input data by day into separate outputs<br />18<br />Example: Ingestion<br />
    19. 19. Every N minutes process new data from /incoming<br />Move data into a working directory based on timestamp<br />Transform records<br />Split input data by day into separate outputs<br />If we make no assumptions about input, we recover from previous failures<br />Daily rollover no longer matters<br />19<br />Example: Ingestion<br />
    20. 20. Data streams are everywhere; use Flume<br />For bulk relational database import and export, use Sqoop<br />Consider asynchronous updates to large data stores<br />Incremental merges are possible<br />Use a workflow system for complex jobs<br />Job atomicity is critical<br />ETL best practices still apply<br />20<br />Recap<br />
    21. 21. Questions?<br />email: esammer@cloudera.com<br />twitter: @esammer / @cloudera<br />freenode: #cloudera / #hadoop<br />http://www.cloudera.com<br />

    ×