Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume


Published on

Apache Flume is a highly scalable, distributed, fault tolerant data collection framework for Apache Hadoop and Apache HBase. Flume is designed to transfer massive volumes of event data in a highly scalable way into HDFS or HBase. Flume is declarative and easy to configure and can easily be deployed to a large number of machines using configuration management systems like Puppet or Cloudera Manager. In this talk, we will cover the basic components of Flume, configuring and deploying flume. We will also briefly talk about the metrics Flume exposes, and the various ways in which these can be collected. Apache

Flume is a Top Level Project (TLP) at the Apache Software Foundation, and has made several releases since entering incubation in June, 2011. Flume graduated to become a TLP in July, 2012. The current release of Flume is Flume 1.3.1.

Presenter: Hari Shreedharan, PMC Member and Committer, Apache Flume, Software Engineer, Cloudera

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • If you have a server farm that emits log data in GB/min, then you could hack together a very simple aggregator, but chances are it won't provide reliability, manageability, or scalability.This is why many use Flume: an out-of-the-box aggregator that is an open-source, high-performing, reliable, and scalable aggregator for streaming data.You don’t want to risk outages or scripts failing causing an overload on spindles.Flume is declarative in that you don’t have to write codeFlume is extensible in that you can write your own components to go on top of Flume, which allow you to modify the behavior and feature-set of Flume out of the boxFlume has one hop delivery, if you want end-to-end reliability, use file channel, which we’ll talk about laterNo acknowledgements from terminal destination to client b/c then client forced to hold all events until ack receivedYou want these systems to be occupy less disk footprintSet up redundant flows if you’re concerned about hardware failures, flume doesn’t support splicing or raid out of the box
  • With Flume NG, there is built-in buffering capacity at every hop. Thus, data and events will be preserved. In regards, to single-hop reliability, the degree of reliability is based on the channel: memory channel and recoverable memory channel are best-effort, whereas file channel and jdbc channel are reliable because you write to disk.OGgarden hose connected from faucet to sprinklercontiguous flow except when you pinch the hose in the middleNGhose connects multiple water tanks (i.e. channels/passive buffers) from faucet to sprinklerif you pinch the hose, the flow doesn't stop1. decouple impedance between producers and consumers2. dynamic routing capabilities (can shutdown one tank to re-route traffic)3. unrestricted capacity (consumer's input no longer restricted by producer's output as one tank can feed into multiple downstream tanks)
  • Flume flowSimplest individual component is agent which can talk to each other and to hdfs,hbase, etcClients talk to agents
  • Clientless operation – agent loads up info using specialized sourcesAgent is a collection of sources, channels, sinksSource captures events from external, only exec source can generate events on its ownChannel is buffer between source and sinkSink has responsibility of draining channel out to another agent or terminal point like hdfsYou can’t have a source with no place to write events
  • In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
  • WHAT MAKES IT ACTIVE?Src2 is inactive b/c it’s not in the active setDefine multiple sources for same agent by space separated listsFan out: source write to two channelsMultiple sinks drain same channel for increased throughputSource can write to multiple channelsChannel is implemented as queue: source appends data to end of queue and sink drains from head of queueConfig file is checked at startup and changes are checked for every 30 sec – don’t have to restart agents if config file changedWhat use-case would need to have multiple sinks draining the same channel?Sources are multi-threaded and greedily implemented (for improved throughput)Sinks are single-threaded and have fixed capacity on what they can drainImpedance mismatch between sources and sinksSources will expand to accommodate load, bursty traffic, so downstream won’t be affectedSinks will drain steadilyAdd another sink to the same channel to meet steady-state requirement
  • Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
  • Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
  • Transactional semantics for storing dataif sink takes data out, it will commit only if source on next hop has committed its data
  • Use-cases:You want the same data to go into hdfs and into hbasePriority based routingAny contextual routing
  • JMS – client talks to broker, which handles failures
  • on avro, once the source commits the events on its channel via a put transaction, the source sends a success msg to the previous hop and the sink on the previous hop deletes these events once it commits the take transaction
  • Takes a command as a config parameter and executes that command, whatever it writes to stdout, it will write each event out to the channelIf channel is full, data is dropped and lostDuring file rotation, if event fails, then data is lost
  • Interceptor is transparent component that gets applied to the flow and can do filtering and minor modification of the event but can’t have interceptor do multiplication of event – e.g. can’t do decompression of event because batching, compression are framework level concerns that Flume should addressOverall number of events emitted by the interceptor can not be more than the number of events that came into the interceptor – you can drop but can’t add events (which would go over the transaction capacity)
  • Interceptor never returns null b/c it’s passed to next interceptor or channel
  • File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
  • Recommended to use three disks: one disk for checkpointing and two disks for dataKeep-alive – wait 3 seconds for the blocks to free up – usually only used in high stress environments
  • Three files: checkpoint file (memory mapped by flume event queue), log1 and log2Checkpoint file = FE QIf you lose FEQ, you don’t lose data since it’s in the log files but takes a long time to remap data into memoryChannel’s main operations are done on top of flume event queue, which is a queue of pointers which point to different locations and different log filesFEQ is queue of active data that exists within file channel and contains reference count of filesEach log file contains metadata of itself – write-ahead log, not direct serialization of dataFEQ doesn’t store data, size of your events don’t impact the FEQ
  • Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  • Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
  • Groups active sinks together and then adds a processorLoad_balance - shipped w round robin and random distribution and back off – but you can write your own selection algorithm and plug it into the sink processorFailover supports round robin, random, and back off (won’t try failed sink until back off time period is over)
  • Interface that exposes itisActive can be used for testingThis is a way of getting data into flumeClient can talk to flume’s avro/thrift source
  • Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

    1. 1. Large Scale Data Ingest Using NOT USE PUBLICLY DO Apache Flume PRIOR TO 10/23/12 Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here Software Engineer , Cloudera Apache Flume PMC member / committer February 20131
    2. 2. Why event streaming with Flume is awesome • Couldn’t I just do this with a shell script? • What year is this, 2001? There is a better way! • Scalable collection, aggregation of event data (i.e. logs) • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software2
    3. 3. Lessons learned from Flume OG • Hard to get predictable performance without decoupling tier impedance • Hard to scale-out without multiple threads at the sink level • A lot of functionality doesn’t work well as a decorator • People need a system that keeps the data flowing when there is a network partition (or downed host in the critical path)3
    4. 4. Inside a Flume NG agent4
    5. 5. Topology: Connecting agents together [Client]+  Agent [ Agent]*  Destination5
    6. 6. Basic Concepts • Client • Valid Configuration • Log4j Appender • Must have at least one • Client SDK Channel • Clientless Operation • Must have at least one source or sink • Agent • Any number of sources • Source • Any number of channels • Channel • Any number of Sinks • Sink6
    7. 7. Concepts in Action • Source: Puts events into the Channel • Sink: Drains events from the Channel • Channel: Store the events until drained7
    8. 8. Flow Reliability success Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support8
    9. 9. Reliability • Transactional guarantees from channel • External client needs handle retry • Built in avro-client to read streams • Avro source for multi-hop flows • Use Flume Client SDK for customization9
    10. 10. Configuration Tree10
    11. 11. Hierarchical Namespace agent1.properties: # Active components agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # Define and configure src1 agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = agent1.sources.src1.port = 10112 # Define and configure sink1 agent1.sinks.sink1.type = logger agent1.sinks.sink1.channel = ch1 # Define and configure ch1 agent1.channels.ch1.type = memory11
    12. 12. Basic Configuration Rules # Active components agent1.sources = src1 • Only the named agents’ configuration loaded agent1.channels = ch1 agent1.sinks = sink1 • Only active components’ configuration # Define and configure src1 loaded within the agents’ configuration agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = • Every Agent must have at least one channel agent1.sources.src1.port = 10112 • Every Source must have at least one channel # Define and configure sink1 agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel agent1.sinks.sink1.channel = ch1 • Every component must have a type # Define and configure ch1 agent1.channels.ch1.type = memory # Some other Agents’ configuration agent2.sources = src1 src212
    13. 13. Deployment Steady state inflow == outflow 4 Tier 1 agents at 100 events/sec (batch-size)  1 Tier 2 agent at 400 eps13
    14. 14. Source • Event Driven • Supports Batch Processing • Source Types: • AVRO – RPC source – other Flume agents can send data to this source port • THRIFT – RPC source (available in next Flume release) • SPOOLDIR – pick up rotated log files • HTTP – post to a REST service (extensible) • JMS – ingest from Java Message Service • SYSLOGTCP, SYSLOGUDP • NETCAT • EXEC14
    15. 15. How Does a Source Work? • Read data from external clients/other sinks • Stores events in configured channel(s) • Asynchronous to the other end of channel • Transactional semantics for storing data15
    16. 16. BeginSource Txn ChannelEvent EventEvent EventEvent Event Transaction batch EventEvent EventEvent Commit Txn
    17. 17. Source Features • Event driven or Pollable • Supports Batching • Fanout of flow • Interceptors17
    18. 18. Fanout Transaction Interceptor handling Flow 2 Channel Channel2 Processor Source Channel Selector Channel1 Fanout processing Flow 118
    19. 19. Channel Selector • Replicating selector • Replicate events to all channels • Multiplexing selector • Contextual routing agent1.sources.sr1.selector.type = multiplexing agent1.sources.sr1.selector.mapping.foo = channel1 agent1.sources.sr1.selector.mapping.bar = channel2 agent1.sources.sr1.selector.default = channel1 agent1.sources.sr1.selector.header = yourHeader19
    20. 20. Built-in Sources in Flume • Asynchronous sources • Client dont handle failures • Exec, Syslog • Synchronous sources • Client handles failures • Avro, Scribe, HTTP, JMS • Flume 0.9x Source • AvroLegacy, ThriftLegacy20
    21. 21. RPC Sources – Avro and Thrift • Reading events from external client • Only TCP • Connecting two agents in a distributed flow • Based on IPC thus failure notification is enabled • Configuration agent_foo.sources.rpcsource-1.type = avro/thrift agent_foo.sources.rpcsource-1.bind = <host> agent_foo.sources.rpcsource-1.port = <port>21
    22. 22. Spooling Directory Source • Parses rotated log files out of a “spool” directory • Watches for new files, renames or deletes them when done • The files must be immutable before being placed into the watched directory agent.sources.spool.type = spooldir agent.sources.spool.spoolDir = /var/log/spooled-files agent.sources.spool.deletePolicy = never OR immediate22
    23. 23. HTTP Source • Runs a web server that handles HTTP requests • The handler is pluggable (can roll your own) • Out of the box, an HTTP client posts a JSON array of events to the server. Server parses the events and puts them on the channel. agent.sources.http.type = http agent.sources.http.port = 808123
    24. 24. HTTP Source, cont’d. • Default handler supports events that look like this: [{ "headers" : { "timestamp" : "434324343", "host" : ”host1.example.com" }, "body" : ”arbitrary data in body string" }, { "headers" : { "namenode" : ”nn01.example.com", "datanode" : ”dn102.example.com" }, "body" : ”some other arbitrary data in body string" }]24
    25. 25. Exec Source • Reading data from a output of a command • Can be used for ‘tail –F ..’ • Doesn’t handle failures .. Configuration: agent_foo.sources.execSource.type = exec agent_foo.sources.execSource.command = tail -F /var/log/weblog.out’25
    26. 26. JMS Source • Reads messages from a JMS queue or topic, converts them to Flume events and puts those events onto the channel. • Pluggable Converter that by default converrts Bytes, Text, and Object messages into Flume Events. • So far, tested with ActiveMQ. We’d like to hear about experiences with any other JMS implementations. agent.sources.jms.type = jms agent.sources.jms.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory agent.sources.jms.providerURL = tcp://mqserver:61616 agent.sources.jms.destinationName = BUSINESS_DATA agent.sources.jms.destinationType = QUEUE26
    27. 27. Interceptor • Applied to Source configuration element • One source can have many interceptors • Chain-of-responsibility • Can be used for tagging, filtering, routing* • Built-in interceptors: • TIMESTAMP • HOST • STATIC • REGEX EXTRACTOR27
    28. 28. Writing a custom interceptor • Configuration: # Declare interceptors agent1.sources.src1.interceptors = int1 int2 … # Define each interceptor agent1.sources.src1.interceptors.int1.type = <type> agent1.sources.src1.interceptors.int1.foo = bar • Custom Interceptors: org.apache.flume.interceptor.Interceptor: void close() void initialize() Event intercept(Event) List<Event> intercept(List<Event> events) org.apache.flume.interceptor.Interceptor.Builder Interceptor build() void configure(Context)28
    29. 29. Channel Selector • Applied to Source, at most one. • Not a Named Component • Built-in Channel Selectors: • REPLICATING (Default) • MULTIPLEXING • Multiplexing Channel Selector: • Contextual Routing • Must have a default set of channels agent1.sources.src1.selector.type = MULTIPLEXING agent1.sources.src1.selector.mapping.foo = ch1 agent1.sources.src1.selector.mapping.bar = ch2 agent1.sources.src1.selector.mapping.baz = ch1 ch2 agent1.sources.src1.selector.default = ch5 ch629
    30. 30. Custom Channel Selector • Configuration: agent1.sources.src1.selector.type = <type> agent1.sources.src1.selector.prop1 = value1 agent1.sources.src1.selector.prop2 = value2 • Interface: org.apache.flume.ChannelSelector void setChannels(List<Channel>) List<Channel> getRequiredChannels(Event) List<Channel> getOptionalChannels(Event) List<Channel> getAllChannels() void configure(Context)30
    31. 31. Channel • Passive Component • Determines the reliability of a flow • “Stock” channels that ship with Flume • FILE – provides durability; most people use this • MEMORY – lower latency for small writes, but not durable • JDBC – provides full ACID support, but has performance issues31
    32. 32. File Channel • Write Ahead Log implementation • Configuration: agent1.channels.ch1.type = FILE agent1.channels.ch1.checkpointDir = <dir> agent1.channels.ch1.dataDirs = <dir1> <dir2>… agent1.channels.ch1.capacity = N (100k) agent1.channels.ch1.transactionCapacity = n agent1.channels.ch1.checkpointInterval = n (30000) agent1.channels.ch1.maxFileSize = N (1.52G) agent1.channels.ch1.write-timeout = n (10s) agent1.channels.ch1.checkpoint-timeout = n (600s)32
    33. 33. File Channel Flume Event Queue • In memory representation of the channel • Maintains queue of pointers to the data on disk in various log files. Reference counts log files. • Is memory mapped to a check point file Log Files • On disk representation of actions (Puts/Takes/Commits/Rollbacks) • Maintains actual data • Log files with 0 refs get deleted33
    34. 34. Sink • Polling Semantics • Supports Batch Processing • Specialized Sinks • HDFS (Write to HDFS – highly configurable) • HBASE, ASYNCHBASE (Write to Hbase) • AVRO (IPC Sink – Avro Source as IPC source at next hop) • THRIFT (IPC Sink – Thrift Source as IPC source at next hop) • FILE_ROLL (Local disk, roll files based on size, # of events etc) • NULL, LOGGER (For Testing Purposes) • ElasticSearch • IRC34
    35. 35. HDFS Sink • Writes events to HDFS (what!) • Configuring (taken from Flume User Guide):35
    36. 36. HDFS Sink • Supports dynamic directory naming using tags • Use event headers : %{header} • Eg: hdfs://namenode/flume/%{header} • Use timestamp from the event header • Use various options to use this. • Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/ • Use roundValue and roundUnit to round down the timestamp to use separate directories. • Within a directory – files rolled based on: • rollInterval – time since last event was written • rollSize – max size of the file • rollCount – max # of events per file36
    37. 37. AsyncHBase Sink • Insert events and increments into Hbase • Writes events asynchronously at very high rate. • Easy to configure: • table • columnFamily • batchSize - # events per txn. • timeout - how long to wait for success callback • serializer/serializer.* - Custom serializer can decide how and where the events are written out.37
    38. 38. IPC Sinks (Avro/Thrift) • Sends events to the next hop’s IPC Source  • Configuring: • hostname • port • batch-size - # events per txn/batch sent to next hop • request-timeout – how long to wait for success of batch38
    39. 39. Serializers • Supported by HDFS, Hbase and File_Roll sink • Convert the event into a format of user’s choice. • In case of Hbase, convert an event into Puts and Increments.39
    40. 40. Sink Group • Top-level element, needed to declare sink processors • A sink can be at most in one group at anytime • By default all sinks are in their individual default sink group • Default sink group is a pass-through • Deactivating sink-group does not deactivate the sink!!40
    41. 41. Sink Processor • Acts as a Sink Proxy • Can work with multiple Sinks • Built-in Sink Processors: • DEFAULT • FAILOVER • LOAD_BALANCE • Applied via Groups! • A Top-Level Component41
    42. 42. Application integration: Client SDK • Factory: org.apache.flume.api.RpcClientFactory: RpcClient getInstance(Properties) org.apache.flume.api.RpcClient: void append(Event) void appendBatch(List<Event>) boolean isActive() • Supports: • Failover client • Load balancing client with ROUND_ROBIN, RANDOM, and custom selectors. • Avro • Thrift42
    43. 43. Clients: Embedded agent • More advanced RPC client. Integrates a channel. • Minimal example: properties.put("channel.type", "memory"); properties.put("channel.capacity", "200"); properties.put("sinks", "sink1"); properties.put("sink1.type", "avro"); properties.put("sink1.hostname", "collector1.example.com"); properties.put("sink1.port", "5564"); EmbeddedAgent agent = new EmbeddedAgent("myagent"); agent.configure(properties); agent.start(); List<Event> events = new ArrayList<Event>(); events.add(event); agent.putAll(events); agent.stop(); • See Flume Developer Guide for more details and examples.43
    44. 44. General Caveats • Reliability = function of channel type, capacity, and system redundancy • Carefully size the channels for needed capacity • Set batch sizes based on projected drain requirements • Number of cores should be ½ total # of sources & sinks combined in an agent44
    45. 45. A common topology App Tier Flume Agent Tier 1 Flume Agent Tier 2 Storage Tier avro agent11 avro Flume src sinkApp-1 SDK avro file sink avro agent21 hdfs ch src sink file avro agent12 avro ch Flume sinkApp-2 src SDK avro HDFS file agent22 sink avro hdfs ch src sink agent13 avro file Flume avroApp-3 sink ch SDK src avro file sink ch . .. . .. LB LB . .. + + failover failover
    46. 46. Summary • Clients send Events to Agents • Each agent hosts Flume components: Source, Interceptors, Channel Selectors, Channels, Sink Processors & Sinks • Sources & Sinks are active components, Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence provides end-to-end reliability46
    47. 47. Reference docs (1.3.1 release) User Guide: flume.apache.org/FlumeUserGuide.html Dev Guide: flume.apache.org/FlumeDeveloperGuide.html47
    48. 48. Blog posts • Flume performance tuning https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 • Flume and Hbase https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase • File Channel Innards https://blogs.apache.org/flume/entry/apache_flume_filechannel • Architecture of Flume NG https://blogs.apache.org/flume/entry/flume_ng_architecture48
    49. 49. Contributing: How to get involved! • Join the mailing lists: • user-subscribe@flume.apache.org • dev-subscribe@flume.apache.org • Look at the code • github.com/apache/flume – Mirror of the Apache Flume git repo • File or fix a JIRA • issues.apache.org/jira/browse/FLUME • More on how to contribute: • cwiki.apache.org/confluence/display/FLUME/How+to+Contribute49
    50. 50. Questions?50
    51. 51. DO NOT USE PUBLICLY Thank you PRIOR TO 10/23/12 Headline Goes Here Reach out on the mailing lists! Speaker Name or Subhead Goes Here Follow me on Twitter: @harisr123451