Hadoop - Integration Patterns and Practices__HadoopSummit2010

  • 3,569 views
Uploaded on

Hadoop Summit 2010 - Application Track …

Hadoop Summit 2010 - Application Track
Hadoop - Integration Patterns and Practices
Eric Sammer, Cloudera

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,569
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
11

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is a topic/content slide. Duplicate as many of these as are needed. Generally, there is one slide per three minutes of talk time.
  • This is the final slide; generally for questions at the end of the talk.Please post your contact information here.

Transcript

  • 1. Hadoop Integration Patterns and Practices
    Eric Sammer
    Cloudera
  • 2. The Problem
    Data Streams
    Databases
    Job Invocation and Workflow
    2
    Session Agenda
  • 3. Tons of data (otherwise you’re in the wrong room)
    Tons of existing systems
    RDBMS
    Caches
    EDW
    Messaging
    Reporting
    Scheduling and job control
    Management and Monitoring
    3
    The Problem
  • 4. Tons of data (otherwise you’re in the wrong room)
    Tons of existing systems
    RDBMS
    Caches
    EDW
    Messaging
    Reporting
    Scheduling and job control
    Management and Monitoring
    We’ll focus on streams, databases, and job control
    4
    The Problem
  • 5. One of the most common integration cases
    Log data, transactional data, sensor data, user activity, …
    Two flavors
    Streams: boundaries not important
    • Atomic units: boundaries indicate something
    5
    File Ingestion
  • 6. For data streams use Flume
    Agents collect data from a source
    Collectors aggregate agent data
    Multiple sources: Text, “Tail”,Command line executables
    Multiple sinks including HDFS
    • Configurable durability (speed vs. safety)
    Support for minor record transformation, output formats, multiple tiers of collection, dynamic configuration, and more.
    Open source
    6
    Stream Integration
  • 7. Other uses of streams
    Streams are more than just logs
    JMS, AMQP messages: Wire tap and send to Flume
    Turn incremental updates into stream data to avoid DBMS “middleman” (or send to both)
    Many existing problems can be turned into asynchronous streams
    7
  • 8. One option…
    8
  • 9. Better: Async Writer
    9
  • 10. Basic approach: queries (out), inserts (in)
    Works, but slow
    Beware the Hadoop -> RDBMS DDOS attack
    Manage transactions on inserts
    Smarter: Lower level export / import tools
    Go to text formats
    Think in batches (MR to convert text <-> SequenceFiles)
    Use Sqoop!
    10
    Relational Databases
  • 11. Batch incoming edits
    Perform a MR job to apply updates
    Input: Original dataset + incoming batch(es)
    Group by record ID
    Secondary sort on timestamp descending
    Reducer selects the newest record or merges changes over time
    Represent delete operations using a DELETE surrogate record
    i.e. <timestamp>, <record id>, DELETE
    11
    Pattern: Incremental Merge
  • 12. Jobs are usually triggered based on
    Time
    Data arrival
    Service Interface
    An external event
    Production systems must monitor for successful completion
    Jobs can fail (just like tasks)
    Build for job atomicity and clean recovery
    12
    Job Invocation
  • 13. For complex chains of jobs use a workflow engine
    Most systems support different types of steps (e.g. Java MR, Pig, Hive, HDFS commands, shell scripts)
    Don’t write your own
    Hadoop specific: Oozie, Cascading, Azkaban, …
    General ETL: Spring Batch, Kettle, …
    13
    Workflow
  • 14. “What was most interesting is that of the people using a homegrown system, only one said they were at all happy with it, and none would recommend their system.”
    Kevin D. Peterson
    http://kdpeterson.net/blog/2009/11/hadoop-workflow-tools-survey.html
    14
    Really, don’t write your own!
  • 15. Every N minutes process new data from /incoming
    Move data into a working directory based on timestamp
    Transform records
    Split input data by day into separate outputs
    15
    Example: Ingestion
  • 16. Every N minutes process new data from /incoming
    Move data into a working directory based on timestamp
    Transform records
    Split input data by day into separate outputs
    What’s so special here?
    16
    Example: Ingestion
  • 17. Every N minutes process new data from /incoming
    Move data into a working directory based on timestamp
    Allows jobs to run concurrently
    Input is isolated; no duplicate processing
    On failure, move back into /incoming
    Transform records
    Split input data by day into separate outputs
    17
    Example: Ingestion
  • 18. Every N minutes process new data from /incoming
    Move data into a working directory based on timestamp
    Transform records
    Nothing special. This is your logic. OK, your logic is special… you know what I mean.
    Split input data by day into separate outputs
    18
    Example: Ingestion
  • 19. Every N minutes process new data from /incoming
    Move data into a working directory based on timestamp
    Transform records
    Split input data by day into separate outputs
    If we make no assumptions about input, we recover from previous failures
    Daily rollover no longer matters
    19
    Example: Ingestion
  • 20. Data streams are everywhere; use Flume
    For bulk relational database import and export, use Sqoop
    Consider asynchronous updates to large data stores
    Incremental merges are possible
    Use a workflow system for complex jobs
    Job atomicity is critical
    ETL best practices still apply
    20
    Recap
  • 21. Questions?
    email: esammer@cloudera.com
    twitter: @esammer / @cloudera
    freenode: #cloudera / #hadoop
    http://www.cloudera.com