Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

33,903 views

Published on

Apache Flume is a highly scalable, distributed, fault tolerant data collection framework for Apache Hadoop and Apache HBase. Flume is designed to transfer massive volumes of event data in a highly scalable way into HDFS or HBase. Flume is declarative and easy to configure and can easily be deployed to a large number of machines using configuration management systems like Puppet or Cloudera Manager. In this talk, we will cover the basic components of Flume, configuring and deploying flume. We will also briefly talk about the metrics Flume exposes, and the various ways in which these can be collected. Apache

Flume is a Top Level Project (TLP) at the Apache Software Foundation, and has made several releases since entering incubation in June, 2011. Flume graduated to become a TLP in July, 2012. The current release of Flume is Flume 1.3.1.

Presenter: Hari Shreedharan, PMC Member and Committer, Apache Flume, Software Engineer, Cloudera

Published in: Technology
  • Be the first to comment

Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume

  1. 1. Large Scale Data Ingest Using NOT USE PUBLICLY DO Apache Flume PRIOR TO 10/23/12 Headline Goes Here Hari Shreedharan Speaker Name or Subhead Goes Here Software Engineer , Cloudera Apache Flume PMC member / committer February 20131
  2. 2. Why event streaming with Flume is awesome • Couldn’t I just do this with a shell script? • What year is this, 2001? There is a better way! • Scalable collection, aggregation of event data (i.e. logs) • Dynamic, contextual event routing • Low latency, high throughput • Declarative configuration • Productive out of the box, yet powerfully extensible • Open source software2
  3. 3. Lessons learned from Flume OG • Hard to get predictable performance without decoupling tier impedance • Hard to scale-out without multiple threads at the sink level • A lot of functionality doesn’t work well as a decorator • People need a system that keeps the data flowing when there is a network partition (or downed host in the critical path)3
  4. 4. Inside a Flume NG agent4
  5. 5. Topology: Connecting agents together [Client]+  Agent [ Agent]*  Destination5
  6. 6. Basic Concepts • Client • Valid Configuration • Log4j Appender • Must have at least one • Client SDK Channel • Clientless Operation • Must have at least one source or sink • Agent • Any number of sources • Source • Any number of channels • Channel • Any number of Sinks • Sink6
  7. 7. Concepts in Action • Source: Puts events into the Channel • Sink: Drains events from the Channel • Channel: Store the events until drained7
  8. 8. Flow Reliability success Reliability based on: • Transactional Exchange between Agents • Persistence Characteristics of Channels in the Flow Also Available: • Built-in Load balancing Support • Built-in Failover Support8
  9. 9. Reliability • Transactional guarantees from channel • External client needs handle retry • Built in avro-client to read streams • Avro source for multi-hop flows • Use Flume Client SDK for customization9
  10. 10. Configuration Tree10
  11. 11. Hierarchical Namespace agent1.properties: # Active components agent1.sources = src1 agent1.channels = ch1 agent1.sinks = sink1 # Define and configure src1 agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 agent1.sources.src1.port = 10112 # Define and configure sink1 agent1.sinks.sink1.type = logger agent1.sinks.sink1.channel = ch1 # Define and configure ch1 agent1.channels.ch1.type = memory11
  12. 12. Basic Configuration Rules # Active components agent1.sources = src1 • Only the named agents’ configuration loaded agent1.channels = ch1 agent1.sinks = sink1 • Only active components’ configuration # Define and configure src1 loaded within the agents’ configuration agent1.sources.src1.type = netcat agent1.sources.src1.channels = ch1 agent1.sources.src1.bind = 127.0.0.1 • Every Agent must have at least one channel agent1.sources.src1.port = 10112 • Every Source must have at least one channel # Define and configure sink1 agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel agent1.sinks.sink1.channel = ch1 • Every component must have a type # Define and configure ch1 agent1.channels.ch1.type = memory # Some other Agents’ configuration agent2.sources = src1 src212
  13. 13. Deployment Steady state inflow == outflow 4 Tier 1 agents at 100 events/sec (batch-size)  1 Tier 2 agent at 400 eps13
  14. 14. Source • Event Driven • Supports Batch Processing • Source Types: • AVRO – RPC source – other Flume agents can send data to this source port • THRIFT – RPC source (available in next Flume release) • SPOOLDIR – pick up rotated log files • HTTP – post to a REST service (extensible) • JMS – ingest from Java Message Service • SYSLOGTCP, SYSLOGUDP • NETCAT • EXEC14
  15. 15. How Does a Source Work? • Read data from external clients/other sinks • Stores events in configured channel(s) • Asynchronous to the other end of channel • Transactional semantics for storing data15
  16. 16. BeginSource Txn ChannelEvent EventEvent EventEvent Event Transaction batch EventEvent EventEvent Commit Txn
  17. 17. Source Features • Event driven or Pollable • Supports Batching • Fanout of flow • Interceptors17
  18. 18. Fanout Transaction Interceptor handling Flow 2 Channel Channel2 Processor Source Channel Selector Channel1 Fanout processing Flow 118
  19. 19. Channel Selector • Replicating selector • Replicate events to all channels • Multiplexing selector • Contextual routing agent1.sources.sr1.selector.type = multiplexing agent1.sources.sr1.selector.mapping.foo = channel1 agent1.sources.sr1.selector.mapping.bar = channel2 agent1.sources.sr1.selector.default = channel1 agent1.sources.sr1.selector.header = yourHeader19
  20. 20. Built-in Sources in Flume • Asynchronous sources • Client dont handle failures • Exec, Syslog • Synchronous sources • Client handles failures • Avro, Scribe, HTTP, JMS • Flume 0.9x Source • AvroLegacy, ThriftLegacy20
  21. 21. RPC Sources – Avro and Thrift • Reading events from external client • Only TCP • Connecting two agents in a distributed flow • Based on IPC thus failure notification is enabled • Configuration agent_foo.sources.rpcsource-1.type = avro/thrift agent_foo.sources.rpcsource-1.bind = <host> agent_foo.sources.rpcsource-1.port = <port>21
  22. 22. Spooling Directory Source • Parses rotated log files out of a “spool” directory • Watches for new files, renames or deletes them when done • The files must be immutable before being placed into the watched directory agent.sources.spool.type = spooldir agent.sources.spool.spoolDir = /var/log/spooled-files agent.sources.spool.deletePolicy = never OR immediate22
  23. 23. HTTP Source • Runs a web server that handles HTTP requests • The handler is pluggable (can roll your own) • Out of the box, an HTTP client posts a JSON array of events to the server. Server parses the events and puts them on the channel. agent.sources.http.type = http agent.sources.http.port = 808123
  24. 24. HTTP Source, cont’d. • Default handler supports events that look like this: [{ "headers" : { "timestamp" : "434324343", "host" : ”host1.example.com" }, "body" : ”arbitrary data in body string" }, { "headers" : { "namenode" : ”nn01.example.com", "datanode" : ”dn102.example.com" }, "body" : ”some other arbitrary data in body string" }]24
  25. 25. Exec Source • Reading data from a output of a command • Can be used for ‘tail –F ..’ • Doesn’t handle failures .. Configuration: agent_foo.sources.execSource.type = exec agent_foo.sources.execSource.command = tail -F /var/log/weblog.out’25
  26. 26. JMS Source • Reads messages from a JMS queue or topic, converts them to Flume events and puts those events onto the channel. • Pluggable Converter that by default converrts Bytes, Text, and Object messages into Flume Events. • So far, tested with ActiveMQ. We’d like to hear about experiences with any other JMS implementations. agent.sources.jms.type = jms agent.sources.jms.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory agent.sources.jms.providerURL = tcp://mqserver:61616 agent.sources.jms.destinationName = BUSINESS_DATA agent.sources.jms.destinationType = QUEUE26
  27. 27. Interceptor • Applied to Source configuration element • One source can have many interceptors • Chain-of-responsibility • Can be used for tagging, filtering, routing* • Built-in interceptors: • TIMESTAMP • HOST • STATIC • REGEX EXTRACTOR27
  28. 28. Writing a custom interceptor • Configuration: # Declare interceptors agent1.sources.src1.interceptors = int1 int2 … # Define each interceptor agent1.sources.src1.interceptors.int1.type = <type> agent1.sources.src1.interceptors.int1.foo = bar • Custom Interceptors: org.apache.flume.interceptor.Interceptor: void close() void initialize() Event intercept(Event) List<Event> intercept(List<Event> events) org.apache.flume.interceptor.Interceptor.Builder Interceptor build() void configure(Context)28
  29. 29. Channel Selector • Applied to Source, at most one. • Not a Named Component • Built-in Channel Selectors: • REPLICATING (Default) • MULTIPLEXING • Multiplexing Channel Selector: • Contextual Routing • Must have a default set of channels agent1.sources.src1.selector.type = MULTIPLEXING agent1.sources.src1.selector.mapping.foo = ch1 agent1.sources.src1.selector.mapping.bar = ch2 agent1.sources.src1.selector.mapping.baz = ch1 ch2 agent1.sources.src1.selector.default = ch5 ch629
  30. 30. Custom Channel Selector • Configuration: agent1.sources.src1.selector.type = <type> agent1.sources.src1.selector.prop1 = value1 agent1.sources.src1.selector.prop2 = value2 • Interface: org.apache.flume.ChannelSelector void setChannels(List<Channel>) List<Channel> getRequiredChannels(Event) List<Channel> getOptionalChannels(Event) List<Channel> getAllChannels() void configure(Context)30
  31. 31. Channel • Passive Component • Determines the reliability of a flow • “Stock” channels that ship with Flume • FILE – provides durability; most people use this • MEMORY – lower latency for small writes, but not durable • JDBC – provides full ACID support, but has performance issues31
  32. 32. File Channel • Write Ahead Log implementation • Configuration: agent1.channels.ch1.type = FILE agent1.channels.ch1.checkpointDir = <dir> agent1.channels.ch1.dataDirs = <dir1> <dir2>… agent1.channels.ch1.capacity = N (100k) agent1.channels.ch1.transactionCapacity = n agent1.channels.ch1.checkpointInterval = n (30000) agent1.channels.ch1.maxFileSize = N (1.52G) agent1.channels.ch1.write-timeout = n (10s) agent1.channels.ch1.checkpoint-timeout = n (600s)32
  33. 33. File Channel Flume Event Queue • In memory representation of the channel • Maintains queue of pointers to the data on disk in various log files. Reference counts log files. • Is memory mapped to a check point file Log Files • On disk representation of actions (Puts/Takes/Commits/Rollbacks) • Maintains actual data • Log files with 0 refs get deleted33
  34. 34. Sink • Polling Semantics • Supports Batch Processing • Specialized Sinks • HDFS (Write to HDFS – highly configurable) • HBASE, ASYNCHBASE (Write to Hbase) • AVRO (IPC Sink – Avro Source as IPC source at next hop) • THRIFT (IPC Sink – Thrift Source as IPC source at next hop) • FILE_ROLL (Local disk, roll files based on size, # of events etc) • NULL, LOGGER (For Testing Purposes) • ElasticSearch • IRC34
  35. 35. HDFS Sink • Writes events to HDFS (what!) • Configuring (taken from Flume User Guide):35
  36. 36. HDFS Sink • Supports dynamic directory naming using tags • Use event headers : %{header} • Eg: hdfs://namenode/flume/%{header} • Use timestamp from the event header • Use various options to use this. • Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/ • Use roundValue and roundUnit to round down the timestamp to use separate directories. • Within a directory – files rolled based on: • rollInterval – time since last event was written • rollSize – max size of the file • rollCount – max # of events per file36
  37. 37. AsyncHBase Sink • Insert events and increments into Hbase • Writes events asynchronously at very high rate. • Easy to configure: • table • columnFamily • batchSize - # events per txn. • timeout - how long to wait for success callback • serializer/serializer.* - Custom serializer can decide how and where the events are written out.37
  38. 38. IPC Sinks (Avro/Thrift) • Sends events to the next hop’s IPC Source  • Configuring: • hostname • port • batch-size - # events per txn/batch sent to next hop • request-timeout – how long to wait for success of batch38
  39. 39. Serializers • Supported by HDFS, Hbase and File_Roll sink • Convert the event into a format of user’s choice. • In case of Hbase, convert an event into Puts and Increments.39
  40. 40. Sink Group • Top-level element, needed to declare sink processors • A sink can be at most in one group at anytime • By default all sinks are in their individual default sink group • Default sink group is a pass-through • Deactivating sink-group does not deactivate the sink!!40
  41. 41. Sink Processor • Acts as a Sink Proxy • Can work with multiple Sinks • Built-in Sink Processors: • DEFAULT • FAILOVER • LOAD_BALANCE • Applied via Groups! • A Top-Level Component41
  42. 42. Application integration: Client SDK • Factory: org.apache.flume.api.RpcClientFactory: RpcClient getInstance(Properties) org.apache.flume.api.RpcClient: void append(Event) void appendBatch(List<Event>) boolean isActive() • Supports: • Failover client • Load balancing client with ROUND_ROBIN, RANDOM, and custom selectors. • Avro • Thrift42
  43. 43. Clients: Embedded agent • More advanced RPC client. Integrates a channel. • Minimal example: properties.put("channel.type", "memory"); properties.put("channel.capacity", "200"); properties.put("sinks", "sink1"); properties.put("sink1.type", "avro"); properties.put("sink1.hostname", "collector1.example.com"); properties.put("sink1.port", "5564"); EmbeddedAgent agent = new EmbeddedAgent("myagent"); agent.configure(properties); agent.start(); List<Event> events = new ArrayList<Event>(); events.add(event); agent.putAll(events); agent.stop(); • See Flume Developer Guide for more details and examples.43
  44. 44. General Caveats • Reliability = function of channel type, capacity, and system redundancy • Carefully size the channels for needed capacity • Set batch sizes based on projected drain requirements • Number of cores should be ½ total # of sources & sinks combined in an agent44
  45. 45. A common topology App Tier Flume Agent Tier 1 Flume Agent Tier 2 Storage Tier avro agent11 avro Flume src sinkApp-1 SDK avro file sink avro agent21 hdfs ch src sink file avro agent12 avro ch Flume sinkApp-2 src SDK avro HDFS file agent22 sink avro hdfs ch src sink agent13 avro file Flume avroApp-3 sink ch SDK src avro file sink ch . .. . .. LB LB . .. + + failover failover
  46. 46. Summary • Clients send Events to Agents • Each agent hosts Flume components: Source, Interceptors, Channel Selectors, Channels, Sink Processors & Sinks • Sources & Sinks are active components, Channels are passive • Source accepts Events, passes them through Interceptor(s), and if not filtered, puts them on channel(s) selected by the configured Channel Selector • Sink Processor identifies a sink to invoke, that can take Events from a Channel and send it to its next hop destination • Channel operations are transactional to guarantee one-hop delivery semantics • Channel persistence provides end-to-end reliability46
  47. 47. Reference docs (1.3.1 release) User Guide: flume.apache.org/FlumeUserGuide.html Dev Guide: flume.apache.org/FlumeDeveloperGuide.html47
  48. 48. Blog posts • Flume performance tuning https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 • Flume and Hbase https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase • File Channel Innards https://blogs.apache.org/flume/entry/apache_flume_filechannel • Architecture of Flume NG https://blogs.apache.org/flume/entry/flume_ng_architecture48
  49. 49. Contributing: How to get involved! • Join the mailing lists: • user-subscribe@flume.apache.org • dev-subscribe@flume.apache.org • Look at the code • github.com/apache/flume – Mirror of the Apache Flume git repo • File or fix a JIRA • issues.apache.org/jira/browse/FLUME • More on how to contribute: • cwiki.apache.org/confluence/display/FLUME/How+to+Contribute49
  50. 50. Questions?50
  51. 51. DO NOT USE PUBLICLY Thank you PRIOR TO 10/23/12 Headline Goes Here Reach out on the mailing lists! Speaker Name or Subhead Goes Here Follow me on Twitter: @harisr123451

×