• Save
Chicago Data Summit: Flume: An Introduction

Like this? Share it with your network

Share

Chicago Data Summit: Flume: An Introduction

  • 4,575 views
Uploaded on

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. ......

Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,575
On Slideshare
3,769
From Embeds
806
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
0
Likes
14

Embeds 806

http://www.cloudera.com 693
http://lanyrd.com 110
http://blog.cloudera.com 2
http://test.cloudera.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1.
  • 2. Flume
    Logging for the Enterprise
    Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer
    Cloudera, Inc
    Chicago Data Summit, 4/26/11
  • 3. Who Am I?
    Cloudera:
    Software Engineer on the Platform Team
    Flume Project Lead / Designer / Architect
    U of Washington:
    “On Leave” from PhD program
    Research in Systems and Programming Languages
    Previously:
    Computer Security, Embedded Systems.
    3
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 4. An Enterprise Scenario
    You have a bunch of departments with servers generating log files.
    You are required keep logs and want to analyze and profit from them.
    Because of the volume of uncooked data, you’ve started using Cloudera’s Distribution including Apache Hadoop.
    … and you’ve got some several ad-hoc, legacy scripts/systems that copy data from servers/filers and then to HDFS.
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
    4
    It’s log, log .. Everyone wants a log!
  • 5. Ad-hoc gets complicated
    Black box?
    What happens if the person who wrote it leaves?
    Unextensible?
    Is it one-off or flexible enough to handle future needs?
    Unmanageable?
    Do you know when something goes wrong?
    Unreliable?
    If something goes wrong, will it recover?
    Unscalable?
    Hit a ingestion rate limit?
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
    5
  • 6. Cloudera Flume
    Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing.
    Project Goals:
    Scalability
    Reliability
    Extensibility
    Manageability
    Openness
    6
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 7. The Canonical Use Case
    HDFS
    7
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Collector tier
    Agent tier
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 8. The Canonical Use Case
    HDFS
    Flume
    Agent
    server
    8
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Collector tier
    Agent tier
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 9. The Canonical Use Case
    HDFS
    Flume
    Master
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    9
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Collector tier
    Agent tier
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 10. The Canonical Use Case
    HDFS
    Flume
    Master
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    10
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Collector tier
    Agent tier
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 11. Flume’s Key Abstractions
    Data path and control path
    Nodes are in the data path
    Nodes have a source and a sink
    They can take different roles
    A typical topology has agent nodes and collector nodes.
    Optionally it has processor nodes.
    Masters are in the control path.
    Centralized point of configuration.
    Specify sources and sinks
    Can control flows of data between nodes
    Use one master or use many with a ZK-backed quorum
    11
    node
    Agent
    sink
    source
    node
    Collector
    sink
    source
    Master
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 12. Flume’s Key Abstractions
    Data path and control path
    Nodes are in the data path
    Nodes have a source and a sink
    They can take different roles
    A typical topology has agent nodes and collector nodes.
    Optionally it has processor nodes.
    Masters are in the control path.
    Centralized point of configuration.
    Specify sources and sinks
    Can control flows of data between nodes
    Use one master or use many with a ZK-backed quorum
    12
    node
    sink
    source
    node
    sink
    source
    Master
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 13. Outline
    What is Flume?
    Scalability
    Horizontal scalability of all nodes and masters
    Reliability
    Fault-tolerance and High availability
    Extensibility
    Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
    Manageability
    Centralized management supporting dynamic reconfiguration
    Openness
    Apache v2.0 License and an active and growing community
    13
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 14. Scalability
    14
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 15. The Canonical Use Case
    HDFS
    Flume
    Agent
    server
    15
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Collector tier
    Agent tier
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 16. Data path is horizontally scalable
    Add collectors to increase availability and to handle more data
    Assumes a single agent will not dominate a collector
    Fewer connections to HDFS that tax the resource constrained NameNode
    Larger more efficient writes to HDFS and fewer files avoids “small file problem”
    Simplifies security story when supporting Kerborized HDFS or protected production servers.
    • Agents have mechanisms for machine resource tradeoffs
    Write log locally to avoid collector disk IO bottleneck and catastrophic failures
    Compression and batching (trade cpu for network)
    Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)
    16
    HDFS
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 17. Node scalability limits and optimization plans
    17
    HDFS
    Agent
    server
    Agent
    Collector
    server
    Agent
    server
    Agent
    server
    In most deployments today, a single collector is not saturated.
    The current implementation can write at 20MB/s over 1GbE (~1.75 TB/day) due to unoptimized network usage.
    Assuming 1GbE with aggregate disk able to write at close to GbE rate, we can probably reach:
    3-5x by batching to get to wire/disk limit (trade latency for throughput)
    5-10x by compression to trade CPU for throughput (logs highly compressible)
    The limit is probably in the ball park of 40 effective TB/day/collector.
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 18. Control plane is horizontally scalable
    A master controls dynamic configurations of nodes
    Uses consensus protocol to keep state consistent
    Scales well for configuration reads
    Allows for adaptive repartitioning in the future
    Nodes can talk to any master.
    Masters can talk to an existing ZK ensemble
    ZK1
    Node
    Master
    ZK2
    Node
    Master
    ZK3
    Master
    18
    Node
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 19. Reliability
    19
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 20. Failures
    Faults can happen at many levels
    Software applications can fail
    Machines can fail
    Networking gear can fail
    Excessive networking congestion or machine load
    A node goes down for maintenance.
    How do we make sure that events make it to a permanent store?
    20
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 21. Tunable failure recovery modes
    HDFS
    HDFS
    HDFS
    Best effort
    Fire and forget
    Store on failure + retry
    Writes to disk on detected failure.
    One-hop TCP acks
    Failover when faults detected.
    End-to-end reliability
    Write ahead log on agent
    Checksums and End-to-end acks
    Data survives compound failures, and may be retried multiple times
    Agent
    Collector
    Collector
    Agent
    Collector
    Agent
    21
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 22. Load balancing
    22
    Agent
    • Agents are logically partitioned and send to different collectors
    • 23. Use randomization to pre-specify failovers when many collectors exist
    Spread load if a collector goes down.
    Spread load if new collectors added to the system.
    Collector
    Agent
    Agent
    Collector
    Agent
    Agent
    Collector
    Agent
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 24. Load balancing and collector failover
    Agent
    • Agents are logically partitioned and send to different collectors
    • 25. Use randomization to pre-specify failovers when many collectors exist
    Spread load if a collector goes down.
    Spread load if new collectors added to the system.
    23
    Collector
    Agent
    Agent
    Collector
    Agent
    Agent
    Collector
    Agent
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 26. Control plane is Fault Tolerent
    A master controls dynamic configurations of nodes
    Uses consensus protocol to keep state consistent
    Scales well for configuration reads
    Allows for adaptive repartitioning in the future
    Nodes can talk to any master.
    Masters can talk to an existing ZK ensemble
    ZK1
    Node
    Master
    ZK2
    Node
    Master
    ZK3
    Master
    24
    Node
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 27. Control plane is Fault Tolerent
    A master controls dynamic configurations of nodes
    Uses consensus protocol to keep state consistent
    Scales well for configuration reads
    Allows for adaptive repartitioning in the future
    Nodes can talk to any master.
    Masters can talk to an existing ZK ensemble
    ZK1
    Node
    Master
    ZK2
    Master
    ZK3
    Master
    25
    Node
    Node
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 28. Control plane is Fault Tolerent
    A master controls dynamic configurations of nodes
    Uses consensus protocol to keep state consistent
    Scales well for configuration reads
    Allows for adaptive repartitioning in the future
    Nodes can talk to any master.
    Masters can talk to an existing ZK ensemble
    ZK1
    Node
    Master
    ZK2
    Node
    Master
    ZK3
    Master
    26
    Node
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 29. Extensibility
    27
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 30. sink
    sink
    Flume is easy to extend
    Simple source and sink APIs
    An event streaming design
    Many simple operations composes for complex behavior
    Plug-in architecture so you can add your own sources, sinks and decorators and sinks
    28
    sink
    source
    deco
    fanout
    deco
    source
    deco
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 31. Variety of Connectors
    Sources produce data
    Console, Exec, Syslog, Scribe, IRC, Twitter,
    In the works: JMS, AMQP, pubsubhubbub/RSS/Atom
    Sinks consume data
    Console, Local files, HDFS, S3
    Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search
    In the works: JMS, AMQP
    Decorators modify data sent to sinks
    Wire batching, compression, sampling, projection, extraction, throughput throttling
    Custom near real-time processing (Meebo)
    JRuby event modifiers (InfoChimps)
    Cryptographic extensions(Rearden)
    Streaming SQL in-stream-analytics system
    FlumeBase (Aaron Kimball)
    29
    source
    sink
    deco
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 32. Migrating previous enterprise architecture
    30
    HDFS
    filer
    HDFS
    HDFS
    Flume
    Collector
    Agent
    poller
    Msg bus
    Flume
    Flume
    Agent
    amqp
    Collector
    Custom app
    Collector
    Agent
    avro
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 33. Data ingestion pipeline pattern
    31
    HBase
    Incremental Search Idx
    HDFS
    Flume
    Agent
    Hive query
    Agent
    Agent
    Collector
    Fanout
    index
    hbase
    hdfs
    Agent
    svr
    Pig query
    Key lookup
    Range query
    Search query
    Faceted query
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 34. Manageability
    Wheeeeee!
    32
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 35. Configuring Flume
    Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ;
    A concise and precise configuration language for specifying dataflows in a node.
    Dynamic updates of configurations
    Allows for live failover changes
    Allows for handling newly provisioned machines
    Allows for changing analytics
    33
    tail
    filter
    fanout
    roll
    hdfs
    console
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 36. Output bucketing
    Automatic output file management
    Write hdfs files in over time based tags
    34
    HDFS
    Collector
    /logs/web/2010/0715/1200/data-xxx.txt
    /logs/web/2010/0715/1200/data-xxy.txt
    /logs/web/2010/0715/1300/data-xxx.txt
    /logs/web/2010/0715/1300/data-xxy.txt
    /logs/web/2010/0715/1400/data-xxx.txt

    Collector
    node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 37. Configuration is straightforward
    node001: tail(“/var/log/app/log”) | autoE2ESink;
    node002: tail(“/var/log/app/log”) | autoE2ESink;

    node100: tail(“/var/log/app/log”) | autoE2ESink;
    collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
    collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
    collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”)
    35
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 38. Centralized Dataflow Management Interfaces
    One place to specify node sources, sinks and data flows.
    Basic Web interface
    Flume Shell
    Command line interface
    Scriptable
    Cloudera Enterprise
    Flume Monitor App
    Graphical web interface
    36
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 39. Enterprise Friendly
    Integrated as part of CDH3 and Cloudera Enterprise
    RPM and DEB packaging for enterprise Linux
    Flume Node for Windows (beta)
    Cloudera Enterprise Support
    24-7 Support SLAs
    Professional Services
    Cloudera Flume Features for Enterprises
    Kerberos Authentication support for writing to “secure” HDFS
    Detailed JSON-exposed metrics for monitoring integration (beta)
    Log4J collection (beta)
    High Availability via Multiple Master (alpha)
    Encrypted SSL / TLS data path and control path support (dev)
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
    37
  • 40. An enterprise story
    38
    Kerberos HDFS
    Flume
    Collector tier
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Win
    api
    Department Servers
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Linux
    api
    D
    D
    D
    D
    D
    D
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Linux
    api
    Active Directory
    / LDAP
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 41. Openness And Community
    39
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 42. Flume is Open Source
    Apache v2.0 Open Source License
    Independent from Apache Software Foundation
    You have the right to fork or modify the software
    GitHub source code repository
    http://github.com/cloudera/flume
    Regular tarball update versions every 2-3 months.
    Regular CDH packaging updates every 3-4 months.
    Always looking for contributors and committors
    40
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 43. Growing user and developer community
    41
    • Steady growth in users and interest.
    • 44. Lots of innovation comes from community
    • 45. Community folks are willing to tryincomplete features.
    • 46. Early feedback and community fixes
    • 47. Many interesting topologies in the community
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 48. : Multi Datacenter
    42
    HDFS
    Collector tier
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    API server
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    api
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    proc
    Agent
    api
    Processor server
    Agent
    Collector
    api
    Agent
    api
    Agent
    proc
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    proc
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 49. : Multi Datacenter
    43
    HDFS
    Collector tier
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    API server
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    api
    Relay
    Agent
    api
    Agent
    api
    Agent
    Collector
    api
    Agent
    proc
    Agent
    api
    Processor server
    Agent
    Collector
    api
    Agent
    api
    Agent
    proc
    Agent
    api
    Agent
    Collector
    api
    Agent
    api
    Agent
    proc
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 50. : Near Real-time Aggregator
    44
    HDFS
    DB
    Flume
    Agent
    Ad svr
    Collector
    Tracker
    Agent
    Ad svr
    Agent
    Ad svr
    Agent
    Ad svr
    quick
    reports
    Hive job
    verify
    reports
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 51. Community Support
    Community-based mailing lists for support
    “an answer in a few days”
    User: https://groups.google.com/a/cloudera.org/group/flume-user
    Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
    Community-based IRC chat room
    “quick questions, quick answers”
    #flume in irc.freenode.net
    45
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 52. Conclusions
    46
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 53. Summary
    Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs.
    It is centrally managed, which allows for automated and adaptive configurations.
    This design allows for near-real time processing.
    Apache v2.0 License with active and growing community.
    Part of Cloudera’s Distribution including Apache Hadoop updated for CDH3u0 and Cloudera Enterprise.
    Several CDH users in community in production use.
    Several Cloudera Enterprise customers evaluating for production use.
    47
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 54. Related systems
    Remote Syslogng / rsyslog / syslog
    Best effort. If server down, messages lost.
    Chukwa – Yahoo! / Apache Incubator
    Designed as a monitoring system for Hadoop.
    Minibatches, requires MapReduce batch processing to demultiplex data.
    New HBase dependent path
    One of the core contributors (Ari) currently works at Cloudera (not on Chukwa)
    Scribe - Facebook
    Only durable-on-failure reliability mechanisms.
    Collector disk is the bottleneck.
    Little visibility into system performance.
    Little support or documentation.
    Most scribe deploys replaced by “Data Freeway”
    Kafka - LinkedIn
    New system by LinkedIn.
    Pull model.
    Interesting, written in Scala
    48
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 55. Questions?
    Contact info:
    jon@cloudera.com
    Twitter @jmhsieh
    49
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 56.
  • 57. Flow Isolation
    Isolate different kinds of data when and where it is generated
    Have multiple logical nodes on a machine
    Each has their own data source
    Each has their own data sink
    51
    Agent
    Collector
    Agent
    Collector
    Agent
    Collector
    Agent
    Collector
    Collector
    Agent
    Agent
    Collector
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 58. Isolate different kinds of data when and where it is generated
    Have multiple logical nodes on a machine
    Each has their own data source
    Each has their own data sink
    Flow Isolation
    52
    Agent
    Collector
    Agent
    Agent
    Agent
    Collector
    Agent
    Agent
    Collector
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 59. Image credits
    http://www.flickr.com/photos/victorvonsalza/3327750057/
    http://www.flickr.com/photos/victorvonsalza/3207639929/
    http://www.flickr.com/photos/victorvonsalza/3327750059/
    http://www.emvergeoning.com/?m=200811
    http://www.flickr.com/photos/juse/188960076/
    http://www.flickr.com/photos/juse/188960076/
    http://www.flickr.com/photos/23720661@N08/3186507302/
    http://clarksoutdoorchairs.com/log_adirondack_chairs.html
    http://www.flickr.com/photos/dboo/3314299591/
    53
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 60. Master Service Failures
    An master machine should not be the single point of failure!
    Masters keep two kinds of information:
    Configuration information (node/flow configuration)
    Kept in ZooKeeper ensemble for persistent, highly available metadata store
    Failures easily recovered from
    Ephemeral information (heartbeat info, acks, metrics reports)
    Kept in memory
    Failures will lose data
    This information can be lazily replicated
    54
    Jonathan Hsieh, Chicago Data Summit 4/26/2011
  • 61. Dealing with Agent failures
    We do not want to lose data
    Make events durable at the generation point.
    If a log generator goes down, it is not generating logs.
    If the event generation point fails and recovers, data will reach the end point
    Data is durable and survive if machines crashes and reboots
    Allows for synchronous writes in log generating applications.
    Watchdog program to restart agent if it fails.
    55
    Jonathan Hsieh, Chicago Data Summit 4/26/2011