Your SlideShare is downloading. ×
0
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Chicago Data Summit: Flume: An Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Chicago Data Summit: Flume: An Introduction

4,213

Published on

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It …

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.

Published in: Technology, Business
0 Comments
14 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,213
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
14
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1.
  • 2. Flume<br />Logging for the Enterprise<br />Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer<br />Cloudera, Inc<br />Chicago Data Summit, 4/26/11<br />
  • 3. Who Am I?<br />Cloudera:<br />Software Engineer on the Platform Team<br />Flume Project Lead / Designer / Architect<br />U of Washington:<br />“On Leave” from PhD program<br />Research in Systems and Programming Languages<br />Previously: <br />Computer Security, Embedded Systems. <br />3<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 4. An Enterprise Scenario<br />You have a bunch of departments with servers generating log files.<br />You are required keep logs and want to analyze and profit from them.<br />Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop.<br />… and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS.<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />4<br />It’s log, log .. Everyone wants a log!<br />
  • 5. Ad-hoc gets complicated<br />Black box?<br />What happens if the person who wrote it leaves?<br />Unextensible?<br />Is it one-off or flexible enough to handle future needs?<br />Unmanageable?<br />Do you know when something goes wrong?<br />Unreliable?<br />If something goes wrong, will it recover?<br />Unscalable?<br />Hit a ingestion rate limit?<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />5<br />
  • 6. Cloudera Flume<br />Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.<br />Project Goals:<br />Scalability<br />Reliability<br />Extensibility<br />Manageability<br />Openness<br />6<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 7. The Canonical Use Case<br />HDFS<br />7<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 8. The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />8<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 9. The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />9<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 10. The Canonical Use Case<br />HDFS<br />Flume<br />Master<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />10<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 11. Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a sink<br />They can take different roles<br />A typical topology has agent nodes and collector nodes.<br />Optionally it has processor nodes.<br />Masters are in the control path.<br />Centralized point of configuration.<br />Specify sources and sinks <br />Can control flows of data between nodes<br />Use one master or use many with a ZK-backed quorum<br />11<br />node<br />Agent<br /> sink<br />source<br />node<br />Collector<br /> sink<br />source<br />Master<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 12. Flume’s Key Abstractions<br />Data path and control path<br />Nodes are in the data path <br />Nodes have a source and a sink<br />They can take different roles<br />A typical topology has agent nodes and collector nodes.<br />Optionally it has processor nodes.<br />Masters are in the control path.<br />Centralized point of configuration.<br />Specify sources and sinks <br />Can control flows of data between nodes<br />Use one master or use many with a ZK-backed quorum<br />12<br />node<br /> sink<br />source<br />node<br /> sink<br />source<br />Master<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 13. Outline<br />What is Flume?<br />Scalability<br />Horizontal scalability of all nodes and masters<br />Reliability<br />Fault-tolerance and High availability <br />Extensibility<br />Unix principle, all kinds of data, all kinds of sources, all kinds of sinks<br />Manageability<br />Centralized management supporting dynamic reconfiguration <br />Openness<br />Apache v2.0 License and an active and growing community<br />13<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 14. Scalability<br />14<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 15. The Canonical Use Case<br />HDFS<br />Flume<br />Agent<br />server<br />15<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Collector tier<br />Agent tier<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 16. Data path is horizontally scalable<br />Add collectors to increase availability and to handle more data<br />Assumes a single agent will not dominate a collector<br />Fewer connections to HDFS that tax the resource constrained NameNode<br />Larger more efficient writes to HDFS and fewer files avoids “small file problem”<br />Simplifies security story when supporting Kerborized HDFS or protected production servers.<br /><ul><li>Agents have mechanisms for machine resource tradeoffs</li></ul>Write log locally to avoid collector disk IO bottleneck and catastrophic failures<br />Compression and batching (trade cpu for network)<br />Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)<br />16<br />HDFS<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 17. Node scalability limits and optimization plans<br />17<br />HDFS<br />Agent<br />server<br />Agent<br />Collector<br />server<br />Agent<br />server<br />Agent<br />server<br />In most deployments today, a single collector is not saturated. <br />The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage.<br />Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach:<br />3-5x by batching to get to wire/disk limit (trade latency for throughput)<br />5-10x by compression to trade CPU for throughput (logs highly compressible)<br />The limit is probably in the ball park of 40 effective TB/day/collector.<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 18. Control plane is horizontally scalable<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />18<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 19. Reliability<br />19<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 20. Failures<br />Faults can happen at many levels<br />Software applications can fail<br />Machines can fail<br />Networking gear can fail<br />Excessive networking congestion or machine load<br />A node goes down for maintenance.<br />How do we make sure that events make it to a permanent store?<br />20<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 21. Tunable failure recovery modes<br />HDFS<br />HDFS<br />HDFS<br />Best effort<br />Fire and forget<br />Store on failure + retry<br />Writes to disk on detected failure.<br />One-hop TCP acks<br />Failover when faults detected. <br />End-to-end reliability<br />Write ahead log on agent<br />Checksums and End-to-end acks<br />Data survives compound failures, and may be retried multiple times<br />Agent<br />Collector<br />Collector<br />Agent<br />Collector<br />Agent<br />21<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 22. Load balancing<br />22<br />Agent<br /><ul><li>Agents are logically partitioned and send to different collectors
  • 23. Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br />Spread load if new collectors added to the system.<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 24. Load balancing and collector failover<br />Agent<br /><ul><li>Agents are logically partitioned and send to different collectors
  • 25. Use randomization to pre-specify failovers when many collectors exist </li></ul>Spread load if a collector goes down.<br />Spread load if new collectors added to the system.<br />23<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Agent<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 26. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />24<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 27. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Master<br />ZK3<br />Master<br />25<br />Node<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 28. Control plane is Fault Tolerent<br />A master controls dynamic configurations of nodes<br />Uses consensus protocol to keep state consistent<br />Scales well for configuration reads<br />Allows for adaptive repartitioning in the future<br />Nodes can talk to any master.<br />Masters can talk to an existing ZK ensemble<br />ZK1<br />Node<br />Master<br />ZK2<br />Node<br />Master<br />ZK3<br />Master<br />26<br />Node<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 29. Extensibility<br />27<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 30. sink<br />sink<br />Flume is easy to extend<br />Simple source and sink APIs<br />An event streaming design<br />Many simple operations composes for complex behavior<br />Plug-in architecture so you can add your own sources, sinks and decorators and sinks<br />28<br />sink<br />source<br />deco<br />fanout<br />deco<br />source<br />deco<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 31. Variety of Connectors<br />Sources produce data<br />Console, Exec, Syslog, Scribe, IRC, Twitter, <br />In the works: JMS, AMQP, pubsubhubbub/RSS/Atom<br />Sinks consume data<br />Console, Local files, HDFS, S3<br />Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search<br />In the works: JMS, AMQP<br />Decorators modify data sent to sinks<br />Wire batching, compression, sampling, projection, extraction, throughput throttling<br />Custom near real-time processing (Meebo)<br />JRuby event modifiers (InfoChimps)<br />Cryptographic extensions(Rearden)<br />Streaming SQL in-stream-analytics system<br />FlumeBase (Aaron Kimball)<br />29<br />source<br />sink<br />deco<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 32. Migrating previous enterprise architecture<br />30<br />HDFS<br />filer<br />HDFS<br />HDFS<br />Flume<br />Collector<br />Agent<br />poller<br />Msg bus<br />Flume<br />Flume<br />Agent<br />amqp<br />Collector<br />Custom app<br />Collector<br />Agent<br />avro<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 33. Data ingestion pipeline pattern<br />31<br />HBase<br />Incremental Search Idx<br />HDFS<br />Flume<br />Agent<br />Hive query<br />Agent<br />Agent<br />Collector<br />Fanout<br />index<br />hbase<br />hdfs<br />Agent<br />svr<br />Pig query<br />Key lookup<br />Range query<br />Search query<br />Faceted query<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 34. Manageability<br />Wheeeeee!<br />32<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 35. Configuring Flume<br />Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;<br />A concise and precise configuration language for specifying dataflows in a node.<br />Dynamic updates of configurations<br />Allows for live failover changes<br />Allows for handling newly provisioned machines<br />Allows for changing analytics<br />33<br />tail<br />filter<br />fanout<br />roll<br />hdfs<br />console<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 36. Output bucketing<br />Automatic output file management <br />Write hdfs files in over time based tags<br />34<br />HDFS<br />Collector<br />/logs/web/2010/0715/1200/data-xxx.txt<br />/logs/web/2010/0715/1200/data-xxy.txt<br />/logs/web/2010/0715/1300/data-xxx.txt<br />/logs/web/2010/0715/1300/data-xxy.txt<br />/logs/web/2010/0715/1400/data-xxx.txt<br />…<br />Collector<br />node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 37. Configuration is straightforward<br />node001: tail(“/var/log/app/log”) | autoE2ESink;<br />node002: tail(“/var/log/app/log”) | autoE2ESink;<br />…<br />node100: tail(“/var/log/app/log”) | autoE2ESink;<br />collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)<br />35<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 38. Centralized Dataflow Management Interfaces<br />One place to specify node sources, sinks and data flows.<br />Basic Web interface <br />Flume Shell<br />Command line interface<br />Scriptable <br />Cloudera Enterprise<br />Flume Monitor App<br />Graphical web interface<br />36<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 39. Enterprise Friendly<br />Integrated as part of CDH3 and Cloudera Enterprise<br />RPM and DEB packaging for enterprise Linux<br />Flume Node for Windows (beta)<br />Cloudera Enterprise Support <br />24-7 Support SLAs<br />Professional Services<br />Cloudera Flume Features for Enterprises<br />Kerberos Authentication support for writing to “secure” HDFS<br />Detailed JSON-exposed metrics for monitoring integration (beta)<br />Log4J collection (beta)<br />High Availability via Multiple Master (alpha)<br />Encrypted SSL / TLS data path and control path support (dev)<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />37<br />
  • 40. An enterprise story<br />38<br />Kerberos HDFS<br />Flume<br />Collector tier<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Win<br />api<br />Department Servers<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Linux<br />api<br />D<br />D<br />D<br />D<br />D<br />D<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Linux<br />api<br />Active Directory<br /> / LDAP<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 41. Openness And Community<br />39<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 42. Flume is Open Source<br />Apache v2.0 Open Source License <br />Independent from Apache Software Foundation<br />You have the right to fork or modify the software<br />GitHub source code repository<br />http://github.com/cloudera/flume<br />Regular tarball update versions every 2-3 months.<br />Regular CDH packaging updates every 3-4 months.<br />Always looking for contributors and committors<br />40<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 43. Growing user and developer community <br />41<br /><ul><li>Steady growth in users and interest.
  • 44. Lots of innovation comes from community
  • 45. Community folks are willing to tryincomplete features.
  • 46. Early feedback and community fixes
  • 47. Many interesting topologies in the community</li></ul>Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 48. : Multi Datacenter<br />42<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />API server<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />proc<br />Agent<br />api<br />Processor server<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 49. : Multi Datacenter<br />43<br />HDFS<br />Collector tier<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />API server<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />api<br />Relay<br />Agent<br />api<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />proc<br />Agent<br />api<br />Processor server<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Agent<br />api<br />Agent<br />Collector<br />api<br />Agent<br />api<br />Agent<br />proc<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 50. : Near Real-time Aggregator<br />44<br />HDFS<br />DB<br />Flume<br />Agent<br />Ad svr<br />Collector<br />Tracker <br />Agent<br />Ad svr<br />Agent<br />Ad svr<br />Agent<br />Ad svr<br />quick<br />reports<br />Hive job<br />verify<br />reports<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 51. Community Support<br />Community-based mailing lists for support<br />“an answer in a few days”<br />User: https://groups.google.com/a/cloudera.org/group/flume-user<br />Dev: https://groups.google.com/a/cloudera.org/group/flume-dev<br />Community-based IRC chat room<br />“quick questions, quick answers”<br />#flume in irc.freenode.net<br />45<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 52. Conclusions<br />46<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 53. Summary<br />Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. <br />It is centrally managed, which allows for automated and adaptive configurations. <br />This design allows for near-real time processing.<br />Apache v2.0 License with active and growing community.<br />Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise.<br />Several CDH users in community in production use.<br />Several Cloudera Enterprise customers evaluating for production use.<br />47<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 54. Related systems<br />Remote Syslogng / rsyslog / syslog<br />Best effort. If server down, messages lost.<br />Chukwa – Yahoo! / Apache Incubator<br />Designed as a monitoring system for Hadoop.<br />Minibatches, requires MapReduce batch processing to demultiplex data.<br />New HBase dependent path <br />One of the core contributors (Ari) currently works at Cloudera (not on Chukwa)<br />Scribe - Facebook<br />Only durable-on-failure reliability mechanisms.<br />Collector disk is the bottleneck.<br />Little visibility into system performance.<br />Little support or documentation.<br />Most scribe deploys replaced by “Data Freeway”<br />Kafka - LinkedIn<br />New system by LinkedIn.<br />Pull model.<br />Interesting, written in Scala<br />48<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 55. Questions?<br />Contact info:<br />jon@cloudera.com<br />Twitter @jmhsieh<br />49<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 56.
  • 57. Flow Isolation<br />Isolate different kinds of data when and where it is generated<br />Have multiple logical nodes on a machine<br />Each has their own data source<br />Each has their own data sink<br />51<br />Agent<br />Collector<br />Agent<br />Collector<br />Agent<br />Collector<br />Agent<br />Collector<br />Collector<br />Agent<br />Agent<br />Collector<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 58. Isolate different kinds of data when and where it is generated<br />Have multiple logical nodes on a machine<br />Each has their own data source<br />Each has their own data sink<br />Flow Isolation<br />52<br />Agent<br />Collector<br />Agent<br />Agent<br />Agent<br />Collector<br />Agent<br />Agent<br />Collector<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 59. Image credits<br />http://www.flickr.com/photos/victorvonsalza/3327750057/<br />http://www.flickr.com/photos/victorvonsalza/3207639929/<br />http://www.flickr.com/photos/victorvonsalza/3327750059/<br />http://www.emvergeoning.com/?m=200811<br />http://www.flickr.com/photos/juse/188960076/<br />http://www.flickr.com/photos/juse/188960076/<br />http://www.flickr.com/photos/23720661@N08/3186507302/<br />http://clarksoutdoorchairs.com/log_adirondack_chairs.html<br />http://www.flickr.com/photos/dboo/3314299591/<br />53<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 60. Master Service Failures<br />An master machine should not be the single point of failure!<br />Masters keep two kinds of information:<br />Configuration information (node/flow configuration) <br />Kept in ZooKeeper ensemble for persistent, highly available metadata store<br />Failures easily recovered from<br />Ephemeral information (heartbeat info, acks, metrics reports)<br />Kept in memory<br />Failures will lose data<br />This information can be lazily replicated<br />54<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />
  • 61. Dealing with Agent failures<br />We do not want to lose data <br />Make events durable at the generation point. <br />If a log generator goes down, it is not generating logs.<br />If the event generation point fails and recovers, data will reach the end point <br />Data is durable and survive if machines crashes and reboots<br />Allows for synchronous writes in log generating applications.<br />Watchdog program to restart agent if it fails.<br />55<br />Jonathan Hsieh, Chicago Data Summit 4/26/2011<br />

×