• Save
Chicago Data Summit: Flume: An Introduction
 

Chicago Data Summit: Flume: An Introduction

on

  • 4,396 views

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It ...

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.

Statistics

Views

Total Views
4,396
Views on SlideShare
3,590
Embed Views
806

Actions

Likes
13
Downloads
0
Comments
0

4 Embeds 806

http://www.cloudera.com 693
http://lanyrd.com 110
http://blog.cloudera.com 2
http://test.cloudera.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Chicago Data Summit: Flume: An Introduction Chicago Data Summit: Flume: An Introduction Presentation Transcript

    • Flume
      Logging for the Enterprise
      Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer
      Cloudera, Inc
      Chicago Data Summit, 4/26/11
    • Who Am I?
      Cloudera:
      Software Engineer on the Platform Team
      Flume Project Lead / Designer / Architect
      U of Washington:
      “On Leave” from PhD program
      Research in Systems and Programming Languages
      Previously:
      Computer Security, Embedded Systems.
      3
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • An Enterprise Scenario
      You have a bunch of departments with servers generating log files.
      You are required keep logs and want to analyze and profit from them.
      Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop.
      … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS.
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
      4
      It’s log, log .. Everyone wants a log!
    • Ad-hoc gets complicated
      Black box?
      What happens if the person who wrote it leaves?
      Unextensible?
      Is it one-off or flexible enough to handle future needs?
      Unmanageable?
      Do you know when something goes wrong?
      Unreliable?
      If something goes wrong, will it recover?
      Unscalable?
      Hit a ingestion rate limit?
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
      5
    • Cloudera Flume
      Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.
      Project Goals:
      Scalability
      Reliability
      Extensibility
      Manageability
      Openness
      6
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • The Canonical Use Case
      HDFS
      7
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Collector tier
      Agent tier
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • The Canonical Use Case
      HDFS
      Flume
      Agent
      server
      8
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Collector tier
      Agent tier
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • The Canonical Use Case
      HDFS
      Flume
      Master
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      9
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Collector tier
      Agent tier
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • The Canonical Use Case
      HDFS
      Flume
      Master
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      10
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Collector tier
      Agent tier
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Flume’s Key Abstractions
      Data path and control path
      Nodes are in the data path
      Nodes have a source and a sink
      They can take different roles
      A typical topology has agent nodes and collector nodes.
      Optionally it has processor nodes.
      Masters are in the control path.
      Centralized point of configuration.
      Specify sources and sinks
      Can control flows of data between nodes
      Use one master or use many with a ZK-backed quorum
      11
      node
      Agent
      sink
      source
      node
      Collector
      sink
      source
      Master
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Flume’s Key Abstractions
      Data path and control path
      Nodes are in the data path
      Nodes have a source and a sink
      They can take different roles
      A typical topology has agent nodes and collector nodes.
      Optionally it has processor nodes.
      Masters are in the control path.
      Centralized point of configuration.
      Specify sources and sinks
      Can control flows of data between nodes
      Use one master or use many with a ZK-backed quorum
      12
      node
      sink
      source
      node
      sink
      source
      Master
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Outline
      What is Flume?
      Scalability
      Horizontal scalability of all nodes and masters
      Reliability
      Fault-tolerance and High availability
      Extensibility
      Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
      Manageability
      Centralized management supporting dynamic reconfiguration
      Openness
      Apache v2.0 License and an active and growing community
      13
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Scalability
      14
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • The Canonical Use Case
      HDFS
      Flume
      Agent
      server
      15
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Collector tier
      Agent tier
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Data path is horizontally scalable
      Add collectors to increase availability and to handle more data
      Assumes a single agent will not dominate a collector
      Fewer connections to HDFS that tax the resource constrained NameNode
      Larger more efficient writes to HDFS and fewer files avoids “small file problem”
      Simplifies security story when supporting Kerborized HDFS or protected production servers.
      • Agents have mechanisms for machine resource tradeoffs
      Write log locally to avoid collector disk IO bottleneck and catastrophic failures
      Compression and batching (trade cpu for network)
      Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)
      16
      HDFS
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Node scalability limits and optimization plans
      17
      HDFS
      Agent
      server
      Agent
      Collector
      server
      Agent
      server
      Agent
      server
      In most deployments today, a single collector is not saturated.
      The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage.
      Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach:
      3-5x by batching to get to wire/disk limit (trade latency for throughput)
      5-10x by compression to trade CPU for throughput (logs highly compressible)
      The limit is probably in the ball park of 40 effective TB/day/collector.
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Control plane is horizontally scalable
      A master controls dynamic configurations of nodes
      Uses consensus protocol to keep state consistent
      Scales well for configuration reads
      Allows for adaptive repartitioning in the future
      Nodes can talk to any master.
      Masters can talk to an existing ZK ensemble
      ZK1
      Node
      Master
      ZK2
      Node
      Master
      ZK3
      Master
      18
      Node
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Reliability
      19
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Failures
      Faults can happen at many levels
      Software applications can fail
      Machines can fail
      Networking gear can fail
      Excessive networking congestion or machine load
      A node goes down for maintenance.
      How do we make sure that events make it to a permanent store?
      20
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Tunable failure recovery modes
      HDFS
      HDFS
      HDFS
      Best effort
      Fire and forget
      Store on failure + retry
      Writes to disk on detected failure.
      One-hop TCP acks
      Failover when faults detected.
      End-to-end reliability
      Write ahead log on agent
      Checksums and End-to-end acks
      Data survives compound failures, and may be retried multiple times
      Agent
      Collector
      Collector
      Agent
      Collector
      Agent
      21
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Load balancing
      22
      Agent
      • Agents are logically partitioned and send to different collectors
      • Use randomization to pre-specify failovers when many collectors exist
      Spread load if a collector goes down.
      Spread load if new collectors added to the system.
      Collector
      Agent
      Agent
      Collector
      Agent
      Agent
      Collector
      Agent
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Load balancing and collector failover
      Agent
      • Agents are logically partitioned and send to different collectors
      • Use randomization to pre-specify failovers when many collectors exist
      Spread load if a collector goes down.
      Spread load if new collectors added to the system.
      23
      Collector
      Agent
      Agent
      Collector
      Agent
      Agent
      Collector
      Agent
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Control plane is Fault Tolerent
      A master controls dynamic configurations of nodes
      Uses consensus protocol to keep state consistent
      Scales well for configuration reads
      Allows for adaptive repartitioning in the future
      Nodes can talk to any master.
      Masters can talk to an existing ZK ensemble
      ZK1
      Node
      Master
      ZK2
      Node
      Master
      ZK3
      Master
      24
      Node
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Control plane is Fault Tolerent
      A master controls dynamic configurations of nodes
      Uses consensus protocol to keep state consistent
      Scales well for configuration reads
      Allows for adaptive repartitioning in the future
      Nodes can talk to any master.
      Masters can talk to an existing ZK ensemble
      ZK1
      Node
      Master
      ZK2
      Master
      ZK3
      Master
      25
      Node
      Node
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Control plane is Fault Tolerent
      A master controls dynamic configurations of nodes
      Uses consensus protocol to keep state consistent
      Scales well for configuration reads
      Allows for adaptive repartitioning in the future
      Nodes can talk to any master.
      Masters can talk to an existing ZK ensemble
      ZK1
      Node
      Master
      ZK2
      Node
      Master
      ZK3
      Master
      26
      Node
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Extensibility
      27
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • sink
      sink
      Flume is easy to extend
      Simple source and sink APIs
      An event streaming design
      Many simple operations composes for complex behavior
      Plug-in architecture so you can add your own sources, sinks and decorators and sinks
      28
      sink
      source
      deco
      fanout
      deco
      source
      deco
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Variety of Connectors
      Sources produce data
      Console, Exec, Syslog, Scribe, IRC, Twitter,
      In the works: JMS, AMQP, pubsubhubbub/RSS/Atom
      Sinks consume data
      Console, Local files, HDFS, S3
      Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search
      In the works: JMS, AMQP
      Decorators modify data sent to sinks
      Wire batching, compression, sampling, projection, extraction, throughput throttling
      Custom near real-time processing (Meebo)
      JRuby event modifiers (InfoChimps)
      Cryptographic extensions(Rearden)
      Streaming SQL in-stream-analytics system
      FlumeBase (Aaron Kimball)
      29
      source
      sink
      deco
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Migrating previous enterprise architecture
      30
      HDFS
      filer
      HDFS
      HDFS
      Flume
      Collector
      Agent
      poller
      Msg bus
      Flume
      Flume
      Agent
      amqp
      Collector
      Custom app
      Collector
      Agent
      avro
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Data ingestion pipeline pattern
      31
      HBase
      Incremental Search Idx
      HDFS
      Flume
      Agent
      Hive query
      Agent
      Agent
      Collector
      Fanout
      index
      hbase
      hdfs
      Agent
      svr
      Pig query
      Key lookup
      Range query
      Search query
      Faceted query
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Manageability
      Wheeeeee!
      32
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Configuring Flume
      Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;
      A concise and precise configuration language for specifying dataflows in a node.
      Dynamic updates of configurations
      Allows for live failover changes
      Allows for handling newly provisioned machines
      Allows for changing analytics
      33
      tail
      filter
      fanout
      roll
      hdfs
      console
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Output bucketing
      Automatic output file management
      Write hdfs files in over time based tags
      34
      HDFS
      Collector
      /logs/web/2010/0715/1200/data-xxx.txt
      /logs/web/2010/0715/1200/data-xxy.txt
      /logs/web/2010/0715/1300/data-xxx.txt
      /logs/web/2010/0715/1300/data-xxy.txt
      /logs/web/2010/0715/1400/data-xxx.txt

      Collector
      node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Configuration is straightforward
      node001: tail(“/var/log/app/log”) | autoE2ESink;
      node002: tail(“/var/log/app/log”) | autoE2ESink;

      node100: tail(“/var/log/app/log”) | autoE2ESink;
      collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
      collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
      collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
      35
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Centralized Dataflow Management Interfaces
      One place to specify node sources, sinks and data flows.
      Basic Web interface
      Flume Shell
      Command line interface
      Scriptable
      Cloudera Enterprise
      Flume Monitor App
      Graphical web interface
      36
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Enterprise Friendly
      Integrated as part of CDH3 and Cloudera Enterprise
      RPM and DEB packaging for enterprise Linux
      Flume Node for Windows (beta)
      Cloudera Enterprise Support
      24-7 Support SLAs
      Professional Services
      Cloudera Flume Features for Enterprises
      Kerberos Authentication support for writing to “secure” HDFS
      Detailed JSON-exposed metrics for monitoring integration (beta)
      Log4J collection (beta)
      High Availability via Multiple Master (alpha)
      Encrypted SSL / TLS data path and control path support (dev)
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
      37
    • An enterprise story
      38
      Kerberos HDFS
      Flume
      Collector tier
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Win
      api
      Department Servers
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Linux
      api
      D
      D
      D
      D
      D
      D
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Linux
      api
      Active Directory
      / LDAP
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Openness And Community
      39
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Flume is Open Source
      Apache v2.0 Open Source License
      Independent from Apache Software Foundation
      You have the right to fork or modify the software
      GitHub source code repository
      http://github.com/cloudera/flume
      Regular tarball update versions every 2-3 months.
      Regular CDH packaging updates every 3-4 months.
      Always looking for contributors and committors
      40
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Growing user and developer community
      41
      • Steady growth in users and interest.
      • Lots of innovation comes from community
      • Community folks are willing to tryincomplete features.
      • Early feedback and community fixes
      • Many interesting topologies in the community
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • : Multi Datacenter
      42
      HDFS
      Collector tier
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      API server
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      api
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      proc
      Agent
      api
      Processor server
      Agent
      Collector
      api
      Agent
      api
      Agent
      proc
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      proc
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • : Multi Datacenter
      43
      HDFS
      Collector tier
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      API server
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      api
      Relay
      Agent
      api
      Agent
      api
      Agent
      Collector
      api
      Agent
      proc
      Agent
      api
      Processor server
      Agent
      Collector
      api
      Agent
      api
      Agent
      proc
      Agent
      api
      Agent
      Collector
      api
      Agent
      api
      Agent
      proc
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • : Near Real-time Aggregator
      44
      HDFS
      DB
      Flume
      Agent
      Ad svr
      Collector
      Tracker
      Agent
      Ad svr
      Agent
      Ad svr
      Agent
      Ad svr
      quick
      reports
      Hive job
      verify
      reports
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Community Support
      Community-based mailing lists for support
      “an answer in a few days”
      User: https://groups.google.com/a/cloudera.org/group/flume-user
      Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
      Community-based IRC chat room
      “quick questions, quick answers”
      #flume in irc.freenode.net
      45
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Conclusions
      46
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Summary
      Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs.
      It is centrally managed, which allows for automated and adaptive configurations.
      This design allows for near-real time processing.
      Apache v2.0 License with active and growing community.
      Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise.
      Several CDH users in community in production use.
      Several Cloudera Enterprise customers evaluating for production use.
      47
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Related systems
      Remote Syslogng / rsyslog / syslog
      Best effort. If server down, messages lost.
      Chukwa – Yahoo! / Apache Incubator
      Designed as a monitoring system for Hadoop.
      Minibatches, requires MapReduce batch processing to demultiplex data.
      New HBase dependent path
      One of the core contributors (Ari) currently works at Cloudera (not on Chukwa)
      Scribe - Facebook
      Only durable-on-failure reliability mechanisms.
      Collector disk is the bottleneck.
      Little visibility into system performance.
      Little support or documentation.
      Most scribe deploys replaced by “Data Freeway”
      Kafka - LinkedIn
      New system by LinkedIn.
      Pull model.
      Interesting, written in Scala
      48
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Questions?
      Contact info:
      jon@cloudera.com
      Twitter @jmhsieh
      49
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Flow Isolation
      Isolate different kinds of data when and where it is generated
      Have multiple logical nodes on a machine
      Each has their own data source
      Each has their own data sink
      51
      Agent
      Collector
      Agent
      Collector
      Agent
      Collector
      Agent
      Collector
      Collector
      Agent
      Agent
      Collector
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Isolate different kinds of data when and where it is generated
      Have multiple logical nodes on a machine
      Each has their own data source
      Each has their own data sink
      Flow Isolation
      52
      Agent
      Collector
      Agent
      Agent
      Agent
      Collector
      Agent
      Agent
      Collector
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Image credits
      http://www.flickr.com/photos/victorvonsalza/3327750057/
      http://www.flickr.com/photos/victorvonsalza/3207639929/
      http://www.flickr.com/photos/victorvonsalza/3327750059/
      http://www.emvergeoning.com/?m=200811
      http://www.flickr.com/photos/juse/188960076/
      http://www.flickr.com/photos/juse/188960076/
      http://www.flickr.com/photos/23720661@N08/3186507302/
      http://clarksoutdoorchairs.com/log_adirondack_chairs.html
      http://www.flickr.com/photos/dboo/3314299591/
      53
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Master Service Failures
      An master machine should not be the single point of failure!
      Masters keep two kinds of information:
      Configuration information (node/flow configuration)
      Kept in ZooKeeper ensemble for persistent, highly available metadata store
      Failures easily recovered from
      Ephemeral information (heartbeat info, acks, metrics reports)
      Kept in memory
      Failures will lose data
      This information can be lazily replicated
      54
      Jonathan Hsieh, Chicago Data Summit 4/26/2011
    • Dealing with Agent failures
      We do not want to lose data
      Make events durable at the generation point.
      If a log generator goes down, it is not generating logs.
      If the event generation point fails and recovers, data will reach the end point
      Data is durable and survive if machines crashes and reboots
      Allows for synchronous writes in log generating applications.
      Watchdog program to restart agent if it fails.
      55
      Jonathan Hsieh, Chicago Data Summit 4/26/2011