Your SlideShare is downloading. ×
Apache Flume (NG)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Apache Flume (NG)

12,652

Published on

Apache FlumeNG presentation, held in Stuttgart. Includes coding and configuration examples

Apache FlumeNG presentation, held in Stuttgart. Includes coding and configuration examples

Published in: Technology
0 Comments
22 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
12,652
On Slideshare
0
From Embeds
0
Number of Embeds
22
Actions
Shares
0
Downloads
354
Comments
0
Likes
22
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. March 2012Apache Flume (NG)Alexander Lorenz | Customer Operations Engineer
  • 2. Overview • Stream data (events, not files) from clients to sinks • Clients: files, syslog, avro, … • Sinks: HDFS files, HBase, … • Configurable reliability levels – Best effort: “Fast and loose” – Guaranteed delivery: “Deliver no matter what” • Configurable routing / topology2 ©2012 Cloudera, Inc. All Rights Reserved.
  • 3. Architecture Component Function Agent The JVM running Flume. One per machine. Runs many sources and sinks. Client Produces data in the form of events. Runs in a separate thread. Sink Receives events from a channel. Runs in a separate thread. Channel Connects sources to sinks (like a queue). Implements the reliability semantics. Event A single datum; a log record, an avro object, etc. Normally around ~4KB.3 ©2012 Cloudera, Inc. All Rights Reserved.
  • 4. Agent • Runs many clients and sinks • Java properties-based configuration • Low overhead (-Xmx20m) – But adding RAM increases performance4 ©2012 Cloudera, Inc. All Rights Reserved.
  • 5. Sources • Plugin interface • Managed by a SourceRunner that controls threading and execution model (e.g. polling vs. event-based) • Included: exec, avro, syslog, …5 ©2012 Cloudera, Inc. All Rights Reserved.
  • 6. Sources public class MySource implements PollableSource { public Status process() { // Do something to create an Event.. Event e = EventBuilder.withBody(…).build(); // A channel instance is injected by Flume. Transaction tx = channel.getTransaction(); tx.begin(); try { channel.put(e); tx.commit(); } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } }6 ©2012 Cloudera, Inc. All Rights Reserved.
  • 7. Channel • Plugin interface • Transactional • Provide queuing between source / sink • Provide reliability semantics – MemoryChannel: Basically a Java BlockingQueue. – JDBC – WAL7 ©2012 Cloudera, Inc. All Rights Reserved.
  • 8. Sinks • Plugin interface • Managed by a SinkRunner that controls threading and execution model • Included: HDFS files (various formats)8 ©2012 Cloudera, Inc. All Rights Reserved.
  • 9. Sinks Code public class MySink implements PollableSink { public Status process() { Transaction tx = channel.getTransaction(); tx.begin(); try { Event e = channel.take(); if (e != null) { // … tx.commit(); } else { return Status.BACKOFF; } } catch (ChannelException ex) { tx.rollback(); return Status.BACKOFF; } finally { tx.close(); } return Status.READY; } }9 ©2012 Cloudera, Inc. All Rights Reserved.
  • 10. Tiered collection • Send events from agents to another tier of agents to aggregate • Use an Avro sink (really just a client) to send events to an Avro source (really just a server) in another machine • Failover supported • Load balancing (soon) • Transactions guarantee handoff10 ©2012 Cloudera, Inc. All Rights Reserved.
  • 11. Tiered collection – The handoff • Agent 1: Tx begin • Agent 1: Channel take event • Agent 1: Sink send • Agent 2: Tx begin • Agent 2: Channel put • Agent 2: Tx commit, respond OK • Agent 1: Tx commit (or rollback)11 ©2012 Cloudera, Inc. All Rights Reserved.
  • 12. Configuration • done in a single file • identifier for flows • <identifier>.type.subtype.parameter.config, where <identifier> is the name of the agent • flow has a client, type, channel, sink • can have multiple channels in one flow12 ©2012 Cloudera, Inc. All Rights Reserved.
  • 13. Simple Configuration Example syslog-agent.sources = Syslogsyslog-agent.channels = MemoryChannel-1syslog-agent.sinks = Consolesyslog-agent.sources.Syslog.type = syslogTcpsyslog-agent.sources.Syslog.port = 5140syslog-agent.sources.Syslog.channels =MemoryChannel-1syslog-agent.sinks.Console.channel = MemoryChannel-1syslog-agent.sinks.Console.type = loggersyslog-agent.channels.MemoryChannel-1.type = memory13 ©2012 Cloudera, Inc. All Rights Reserved.
  • 14. HDFS Configuration Example syslog-agent.sources = Syslogsyslog-agent.channels = MemoryChannel-1syslog-agent.sinks = HDFS-LABsyslog-agent.sources.Syslog.type = syslogTcpsyslog-agent.sources.Syslog.port = 5140syslog-agent.sources.Syslog.channels = MemoryChannel-1syslog-agent.sinks.HDFS-LAB.channel = MemoryChannel-1syslog-agent.sinks.HDFS-LAB.type = hdfssyslog-agent.sinks.HDFS-LAB.hdfs.path = hdfs://NN.URI:PORT/flumetest/%{host}syslog-agent.sinks.HDFS-LAB.hdfs.file.Prefix = syslogfilessyslog-agent.sinks.HDFS-LAB.hdfs.file.rollInterval = 60syslog-agent.sinks.HDFS-LAB.hdfs.file.Type = SequenceFilesyslog-agent.channels.MemoryChannel-1.type = memory 14 ©2012 Cloudera, Inc. All Rights Reserved.
  • 15. Features • Fan out: One source, many channels • Fan in: Many sources, one channel • Processors (aka: decorators) • Auto-batching of events in RPCs… • Multiplexing channels for datamining • Avro implementation in both ways15 ©2012 Cloudera, Inc. All Rights Reserved.
  • 16. Thank You • Web: https://cwiki.apache.org/FLUME/ getting-started.html • ML: flume-user@incubator.apache.org • Mail: alexander@cloudera.com • Blog: mapredit.blogspot.com • Twitter: @mapredit16 ©2012 Cloudera, Inc. All Rights Reserved.

×