Flume<br />Logging for the Enterprise<br />Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer<br />Cloudera, Inc<br...
Who Am I?<br />Cloudera:<br />Software Engineer on the Platform Team<br />Flume Project Lead / Designer / Architect<br />U...
An Enterprise Scenario<br />You have a bunch of departments with servers generating log files.<br />You are required keep ...
Ad-hoc gets complicated<br />Black box?<br />What happens if the person who wrote it leaves?<br />Unextensible?<br />Is it...
Cloudera Flume<br />Flume is a framework and conduit for collecting and quickly shipping data records from of many sources...
The Canonical Use Case<br />HDFS<br />7<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />serve...
The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />8<br />Agent<br />Collector<br />server<br />Agent...
The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />...
The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />...
Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a s...
Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a s...
Outline<br />What is Flume?<br />Scalability<br />Horizontal scalability of all nodes and masters<br />Reliability<br />Fa...
Scalability<br />14<br />Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />15<br />Agent<br />Collector<br />server<br />Agen...
Data path is horizontally scalable<br />Add collectors to increase availability and to handle more data<br />Assumes a sin...
Node scalability limits and optimization plans<br />17<br />HDFS<br />Agent<br />server<br />Agent<br />Collector<br />ser...
Control plane is horizontally scalable<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol...
Reliability<br />19<br />Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
Failures<br />Faults can happen at many levels<br />Software applications can fail<br />Machines can fail<br />Networking ...
Tunable failure recovery modes<br />HDFS<br />HDFS<br />HDFS<br />Best effort<br />Fire and forget<br />Store on failure +...
Load balancing<br />22<br />Agent<br /><ul><li>Agents are logically partitioned and send to different collectors
Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br /...
Load balancing and collector failover<br />Agent<br /><ul><li>Agents are logically partitioned and send to different colle...
Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br /...
Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to kee...
Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to kee...
Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to kee...
Extensibility<br />27<br />Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
sink<br />sink<br />Flume is easy to extend<br />Simple source and sink APIs<br />An event streaming design<br />Many simp...
Variety of Connectors<br />Sources produce data<br />Console, Exec, Syslog, Scribe, IRC, Twitter, <br />In the works: JMS,...
Migrating previous enterprise architecture<br />30<br />HDFS<br />filer<br />HDFS<br />HDFS<br />Flume<br />Collector<br /...
Data ingestion pipeline pattern<br />31<br />HBase<br />Incremental Search Idx<br />HDFS<br />Flume<br />Agent<br />Hive q...
Manageability<br />Wheeeeee!<br />32<br />Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
Configuring Flume<br />Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;<br />A ...
Output bucketing<br />Automatic output file management <br />Write hdfs files in over time based tags<br />34<br />HDFS<br...
Configuration is straightforward<br />node001: tail(“/var/log/app/log”) | autoE2ESink;<br />node002: tail(“/var/log/app/lo...
Centralized Dataflow Management Interfaces<br />One place to specify node sources, sinks and data flows.<br />Basic Web in...
Enterprise Friendly<br />Integrated as part of CDH3 and Cloudera Enterprise<br />RPM and DEB packaging for enterprise Linu...
An enterprise story<br />38<br />Kerberos HDFS<br />Flume<br />Collector tier<br />Agent<br />api<br />Agent<br />Collecto...
Openness And Community<br />39<br />Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
Flume is Open Source<br />Apache v2.0 Open Source License <br />Independent from Apache Software Foundation<br />You have ...
Growing user and developer community <br />41<br /><ul><li>Steady growth in users and interest.
Lots of innovation comes from community
Community folks are willing to tryincomplete features.
Early feedback and community fixes
Many interesting topologies in the community</li></ul>Jonathan Hsieh, Chicago Data Summit  4/26/2011<br />
                       : Multi Datacenter<br />42<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br...
                       : Multi Datacenter<br />43<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br...
             : Near Real-time Aggregator<br />44<br />HDFS<br />DB<br />Flume<br />Agent<br />Ad svr<br />Collector<br />T...
Upcoming SlideShare
Loading in...5
×

Chicago Data Summit: Flume: An Introduction

4,241

Published on

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.

Published in: Technology, Business
0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,241
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
14
Embeds 0
No embeds

No notes for slide

Transcript of "Chicago Data Summit: Flume: An Introduction"

  1. 1.
  2. 2. Flume<br />Logging for the Enterprise<br />Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer<br />Cloudera, Inc<br />Chicago Data Summit, 4/26/11<br />
  3. 3. Who Am I?<br />Cloudera:<br />Software Engineer on the Platform Team<br />Flume Project Lead / Designer / Architect<br />U of Washington:<br />“On Leave” from PhD program<br />Research in Systems and Programming Languages<br />Previously: <br />Computer Security, Embedded Systems. <br />3<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  4. 4. An Enterprise Scenario<br />You have a bunch of departments with servers generating log files.<br />You are required keep logs and want to analyze and profit from them.<br />Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop.<br />… and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS.<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />4<br />It’s log, log .. Everyone wants a log!<br />
  5. 5. Ad-hoc gets complicated<br />Black box?<br />What happens if the person who wrote it leaves?<br />Unextensible?<br />Is it one-off or flexible enough to handle future needs?<br />Unmanageable?<br />Do you know when something goes wrong?<br />Unreliable?<br />If something goes wrong, will it recover?<br />Unscalable?<br />Hit a ingestion rate limit?<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />5<br />
  6. 6. Cloudera Flume<br />Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.<br />Project Goals:<br />Scalability<br />Reliability<br />Extensibility<br />Manageability<br />Openness<br />6<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  7. 7. The Canonical Use Case<br />HDFS<br />7<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  8. 8. The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />8<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  9. 9. The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />9<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  10. 10. The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />10<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  11. 11. Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a sink<br />They can take different roles<br />A typical topology has agent nodes and collector nodes.<br />Optionally it has processor nodes.<br />Masters are in the control path.<br />Centralized point of configuration.<br />Specify sources and sinks <br />Can control flows of data between nodes<br />Use one master or use many with a ZK-backed quorum<br />11<br />node<br />Agent<br /> sink<br />source<br />node<br />Collector<br /> sink<br />source<br />Master<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  12. 12. Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a sink<br />They can take different roles<br />A typical topology has agent nodes and collector nodes.<br />Optionally it has processor nodes.<br />Masters are in the control path.<br />Centralized point of configuration.<br />Specify sources and sinks <br />Can control flows of data between nodes<br />Use one master or use many with a ZK-backed quorum<br />12<br />node<br /> sink<br />source<br />node<br /> sink<br />source<br />Master<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  13. 13. Outline<br />What is Flume?<br />Scalability<br />Horizontal scalability of all nodes and masters<br />Reliability<br />Fault-tolerance and High availability <br />Extensibility<br />Unix principle, all kinds of data, all kinds of sources, all kinds of sinks<br />Manageability<br />Centralized management supporting dynamic reconfiguration <br />Openness<br />Apache v2.0 License and an active and growing community<br />13<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  14. 14. Scalability<br />14<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  15. 15. The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />15<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  16. 16. Data path is horizontally scalable<br />Add collectors to increase availability and to handle more data<br />Assumes a single agent will not dominate a collector<br />Fewer connections to HDFS that tax the resource constrained NameNode<br />Larger more efficient writes to HDFS and fewer files avoids “small file problem”<br />Simplifies security story when supporting Kerborized HDFS or protected production servers.<br /><ul><li>Agents have mechanisms for machine resource tradeoffs</li></ul>Write log locally to avoid collector disk IO bottleneck and catastrophic failures<br />Compression and batching (trade cpu for network)<br />Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)<br />16<br />HDFS<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  17. 17. Node scalability limits and optimization plans<br />17<br />HDFS<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />In most deployments today, a single collector is not saturated. <br />The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage.<br />Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach:<br />3-5x by batching to get to wire/disk limit (trade latency for throughput)<br />5-10x by compression to trade CPU for throughput (logs highly compressible)<br />The limit is probably in the ball park of 40 effective TB/day/collector.<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  18. 18. Control plane is horizontally scalable<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />18<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  19. 19. Reliability<br />19<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  20. 20. Failures<br />Faults can happen at many levels<br />Software applications can fail<br />Machines can fail<br />Networking gear can fail<br />Excessive networking congestion or machine load<br />A node goes down for maintenance.<br />How do we make sure that events make it to a permanent store?<br />20<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  21. 21. Tunable failure recovery modes<br />HDFS<br />HDFS<br />HDFS<br />Best effort<br />Fire and forget<br />Store on failure + retry<br />Writes to disk on detected failure.<br />One-hop TCP acks<br />Failover when faults detected. <br />End-to-end reliability<br />Write ahead log on agent<br />Checksums and End-to-end acks<br />Data survives compound failures, and may be retried multiple times<br />Agent<br />Collector<br />Collector<br />Agent<br />Collector<br />Agent<br />21<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  22. 22. Load balancing<br />22<br />Agent<br /><ul><li>Agents are logically partitioned and send to different collectors
  23. 23. Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br />Spread load if new collectors added to the system.<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  24. 24. Load balancing and collector failover<br />Agent<br /><ul><li>Agents are logically partitioned and send to different collectors
  25. 25. Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br />Spread load if new collectors added to the system.<br />23<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  26. 26. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />24<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  27. 27. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Master<br />ZK3<br />Master<br />25<br />Node<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  28. 28. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />26<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  29. 29. Extensibility<br />27<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  30. 30. sink<br />sink<br />Flume is easy to extend<br />Simple source and sink APIs<br />An event streaming design<br />Many simple operations composes for complex behavior<br />Plug-in architecture so you can add your own sources, sinks and decorators and sinks<br />28<br />sink<br />source<br />deco<br />fanout<br />deco<br />source<br />deco<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  31. 31. Variety of Connectors<br />Sources produce data<br />Console, Exec, Syslog, Scribe, IRC, Twitter, <br />In the works: JMS, AMQP, pubsubhubbub/RSS/Atom<br />Sinks consume data<br />Console, Local files, HDFS, S3<br />Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search<br />In the works: JMS, AMQP<br />Decorators modify data sent to sinks<br />Wire batching, compression, sampling, projection, extraction, throughput throttling<br />Custom near real-time processing (Meebo)<br />JRuby event modifiers (InfoChimps)<br />Cryptographic extensions(Rearden)<br />Streaming SQL in-stream-analytics system<br />FlumeBase (Aaron Kimball)<br />29<br />source<br />sink<br />deco<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  32. 32. Migrating previous enterprise architecture<br />30<br />HDFS<br />filer<br />HDFS<br />HDFS<br />Flume<br />Collector<br />Agent<br />poller<br />Msg bus<br />Flume<br />Flume<br />Agent<br />amqp<br />Collector<br />Custom app<br />Collector<br />Agent<br />avro<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  33. 33. Data ingestion pipeline pattern<br />31<br />HBase<br />Incremental Search Idx<br />HDFS<br />Flume<br />Agent<br />Hive query<br />Agent<br />Agent<br />Collector<br />Fanout<br />index<br />hbase<br />hdfs<br />Agent<br />svr<br />Pig query<br />Key lookup<br />Range query<br />Search query<br />Faceted query<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  34. 34. Manageability<br />Wheeeeee!<br />32<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  35. 35. Configuring Flume<br />Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;<br />A concise and precise configuration language for specifying dataflows in a node.<br />Dynamic updates of configurations<br />Allows for live failover changes<br />Allows for handling newly provisioned machines<br />Allows for changing analytics<br />33<br />tail<br />filter<br />fanout<br />roll<br />hdfs<br />console<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  36. 36. Output bucketing<br />Automatic output file management <br />Write hdfs files in over time based tags<br />34<br />HDFS<br />Collector<br />/logs/web/2010/0715/1200/data-xxx.txt<br />/logs/web/2010/0715/1200/data-xxy.txt<br />/logs/web/2010/0715/1300/data-xxx.txt<br />/logs/web/2010/0715/1300/data-xxy.txt<br />/logs/web/2010/0715/1400/data-xxx.txt<br />…<br />Collector<br />node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  37. 37. Configuration is straightforward<br />node001: tail(“/var/log/app/log”) | autoE2ESink;<br />node002: tail(“/var/log/app/log”) | autoE2ESink;<br />…<br />node100: tail(“/var/log/app/log”) | autoE2ESink;<br />collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />35<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  38. 38. Centralized Dataflow Management Interfaces<br />One place to specify node sources, sinks and data flows.<br />Basic Web interface <br />Flume Shell<br />Command line interface<br />Scriptable <br />Cloudera Enterprise<br />Flume Monitor App<br />Graphical web interface<br />36<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  39. 39. Enterprise Friendly<br />Integrated as part of CDH3 and Cloudera Enterprise<br />RPM and DEB packaging for enterprise Linux<br />Flume Node for Windows (beta)<br />Cloudera Enterprise Support <br />24-7 Support SLAs<br />Professional Services<br />Cloudera Flume Features for Enterprises<br />Kerberos Authentication support for writing to “secure” HDFS<br />Detailed JSON-exposed metrics for monitoring integration (beta)<br />Log4J collection (beta)<br />High Availability via Multiple Master (alpha)<br />Encrypted SSL / TLS data path and control path support (dev)<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />37<br />
  40. 40. An enterprise story<br />38<br />Kerberos HDFS<br />Flume<br />Collector tier<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Win<br />api<br />Department Servers<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Linux<br />api<br />D<br />D<br />D<br />D<br />D<br />D<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Linux<br />api<br />Active Directory<br /> / LDAP<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  41. 41. Openness And Community<br />39<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  42. 42. Flume is Open Source<br />Apache v2.0 Open Source License <br />Independent from Apache Software Foundation<br />You have the right to fork or modify the software<br />GitHub source code repository<br />http://github.com/cloudera/flume<br />Regular tarball update versions every 2-3 months.<br />Regular CDH packaging updates every 3-4 months.<br />Always looking for contributors and committors<br />40<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  43. 43. Growing user and developer community <br />41<br /><ul><li>Steady growth in users and interest.
  44. 44. Lots of innovation comes from community
  45. 45. Community folks are willing to tryincomplete features.
  46. 46. Early feedback and community fixes
  47. 47. Many interesting topologies in the community</li></ul>Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  48. 48. : Multi Datacenter<br />42<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />API server<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />proc<br />Agent<br />api<br />Processor server<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  49. 49. : Multi Datacenter<br />43<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />API server<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Relay<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />proc<br />Agent<br />api<br />Processor server<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  50. 50. : Near Real-time Aggregator<br />44<br />HDFS<br />DB<br />Flume<br />Agent<br />Ad svr<br />Collector<br />Tracker <br />Agent<br />Ad svr<br />Agent<br />Ad svr<br />Agent<br />Ad svr<br />quick<br />reports<br />Hive job<br />verify<br />reports<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  51. 51. Community Support<br />Community-based mailing lists for support<br />“an answer in a few days”<br />User: https://groups.google.com/a/cloudera.org/group/flume-user<br />Dev: https://groups.google.com/a/cloudera.org/group/flume-dev<br />Community-based IRC chat room<br />“quick questions, quick answers”<br />#flume in irc.freenode.net<br />45<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  52. 52. Conclusions<br />46<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  53. 53. Summary<br />Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. <br />It is centrally managed, which allows for automated and adaptive configurations. <br />This design allows for near-real time processing.<br />Apache v2.0 License with active and growing community.<br />Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise.<br />Several CDH users in community in production use.<br />Several Cloudera Enterprise customers evaluating for production use.<br />47<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  54. 54. Related systems<br />Remote Syslogng / rsyslog / syslog<br />Best effort. If server down, messages lost.<br />Chukwa – Yahoo! / Apache Incubator<br />Designed as a monitoring system for Hadoop.<br />Minibatches, requires MapReduce batch processing to demultiplex data.<br />New HBase dependent path <br />One of the core contributors (Ari) currently works at Cloudera (not on Chukwa)<br />Scribe - Facebook<br />Only durable-on-failure reliability mechanisms.<br />Collector disk is the bottleneck.<br />Little visibility into system performance.<br />Little support or documentation.<br />Most scribe deploys replaced by “Data Freeway”<br />Kafka - LinkedIn<br />New system by LinkedIn.<br />Pull model.<br />Interesting, written in Scala<br />48<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  55. 55. Questions?<br />Contact info:<br />jon@cloudera.com<br />Twitter @jmhsieh<br />49<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  56. 56.
  57. 57. Flow Isolation<br />Isolate different kinds of data when and where it is generated<br />Have multiple logical nodes on a machine<br />Each has their own data source<br />Each has their own data sink<br />51<br />Agent<br />Collector<br />Agent<br />Collector<br />Agent<br />Collector<br />Agent<br />Collector<br />Collector<br />Agent<br />Agent<br />Collector<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  58. 58. Isolate different kinds of data when and where it is generated<br />Have multiple logical nodes on a machine<br />Each has their own data source<br />Each has their own data sink<br />Flow Isolation<br />52<br />Agent<br />Collector<br />Agent<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  59. 59. Image credits<br />http://www.flickr.com/photos/victorvonsalza/3327750057/<br />http://www.flickr.com/photos/victorvonsalza/3207639929/<br />http://www.flickr.com/photos/victorvonsalza/3327750059/<br />http://www.emvergeoning.com/?m=200811<br />http://www.flickr.com/photos/juse/188960076/<br />http://www.flickr.com/photos/juse/188960076/<br />http://www.flickr.com/photos/23720661@N08/3186507302/<br />http://clarksoutdoorchairs.com/log_adirondack_chairs.html<br />http://www.flickr.com/photos/dboo/3314299591/<br />53<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  60. 60. Master Service Failures<br />An master machine should not be the single point of failure!<br />Masters keep two kinds of information:<br />Configuration information (node/flow configuration) <br />Kept in ZooKeeper ensemble for persistent, highly available metadata store<br />Failures easily recovered from<br />Ephemeral information (heartbeat info, acks, metrics reports)<br />Kept in memory<br />Failures will lose data<br />This information can be lazily replicated<br />54<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  61. 61. Dealing with Agent failures<br />We do not want to lose data <br />Make events durable at the generation point. <br />If a log generator goes down, it is not generating logs.<br />If the event generation point fails and recovers, data will reach the end point <br />Data is durable and survive if machines crashes and reboots<br />Allows for synchronous writes in log generating applications.<br />Watchdog program to restart agent if it fails.<br />55<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />

×