Inside Flume

                            Henry Robinson
                          henry@cloudera.com
                               @henryr




Tuesday, 17 August 2010
Who am I?

  • Distributed systems guy

  • Apache ZooKeeper committer

  • I work at Cloudera on Flume, ZooKeeper, Hue, more...

  • p.s. Cloudera is hiring!




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
About Cloudera

  • Software, services and support for Hadoop
  • Built around an open core
        • All our patches get contributed upstream
        • Flume and Hue are open-source
        • We just started the Whirr project
  • We maintain, package and support Cloudera’s Distribution
    for Hadoop
        • Smoothing off a lot of the rough edges around Hadoop
        • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive,
          Pig, Hue, Flume and more.


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What’s the problem?

  • Data collection is currently a priori and ad hoc

  • A priori - decide what you want to collect ahead of time

  • Ad hoc - Each kind of data source goes through its own
    collection path
        • Usually a collection of fragile, custom scripts




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
What is Flume? (and how can it help?)

  • Flume is:
        •   A distributed data collection service
        •   Scalable
        •   Configurable
        •   Extensible
        •   Manageable
        •   Open source
  • How can it help?
        • One-stop solution for data collection of all formats
        • Flexible reliability guarantees allow careful performance tuning
        • Enables quick iteration on new collection strategies
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Model

  • Built around the concept of flows
  • A single flow corresponds to a type of data source
        • Like web server logs
        • Or machine monitoring metrics
  • Different flows might have different compression,
    batching or reliability setups
        • Flume multiplexes many flows onto one service instance
  • Flows are comprised of nodes chained together
        • Each Flume process can run many nodes, so resources are
          shared
        • Each node receives data at its source, and sends it to its sink
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Flows

  • Three typical flows, all on the same Flume service


                               Flow 1: Web-clicks
                            Reliable Delivery, Compressed, Batched
                                                                                EV
              A                                                                    EN
          D AT                                                                        TS



          DATA            Flow 2: Process monitoring                            EVENTS
                                       Best Effort Delivery

          DA
            TA                                                                         N   TS
                                                                                E   VE

                          Flow 3: Advert Impressions
                                         Reliable Delivery




                             Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Anatomy of a Flume node

  • Data come in through a source...
  • ... are optionally processed by one or more decorators...
  • ... and then are transmitted out via a sink
  • Each of these components is (re-)configurable at run-
    time
  • Each has a very simple API, and a plugin interface that
    makes customizing Flume very easy
  • These simple abstractions are sufficient to build more
    complex features like acknowledged delivery, filtering,
    compression

                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Agents and Collectors

  • Nodes that receive data from an application are called
    agents
  • Flume supports many sources for agents, including:
        •   Syslog
        •   Tailing a file
        •   Unix processes
        •   Scribe API
        •   Twitter
  • Nodes that write data to permanent storage are called
    collectors
        • Most often they write to HDFS
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Nodes                                          Source
                                                                      Agent
                                                                                   Sink

                                     HTTPD           Tail Apache             Downstream
                                                     HTTPD logs             processor node



  • Each role may be
    played by many
                                                                   Processor
    different nodes                        Source                 Decorator                    Sink
                                                              Extract browser
                                      Upstream agent        name from log string           Downstream
                                           node             and attach it to event        collector node


  • Usually require
    substantially fewer
    collectors than agents                                           Collector
                                                        Source                       Sink
                                                                                   HDFS://
                                                       Upstream                  namenode/                  S
                                                                                                      HDF
                                                    processor node                /weblogs/
                                                                                 %{browser}/



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume Events

  • All data are transformed into a series of events

  • Events are a pair (body, metadata)

  • Body is a string of bytes

  • Metadata is a table mapping keys to values
        • Flume can use this to inform processing
        • Or simply write it with the event


                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
The Flume Configuration Language

  • Node configurations are written in a simple language
        • my-flume-node : src | { decorator => sink }
  • For example: a configuration to read HTTP log data from
    a file and send it to a collector:
        • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
  • On the collector, receive data and bucket it according to
    browser:
        • web-log-collector : autoCollectorSource
          | { regex(“(Firefox|Internet Explorer)”, “browser”) =>
          collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
  • Two lines to set-up an entire flow
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Keeping Track of Nodes

  • The master service monitors all Flume nodes
        • A single port-of-call for checking on the health of your Flume
          service
  • Send commands to the master, and it will forward them
    to the nodes
  • The Flume Shell is a convenient, scriptable command-line
    tool
  • Web-based UIs are also available



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as a Distributed System

  • Fundamental principle: Keep state out of the data path
    where possible
        •   Replication is costly
        •   Consistency is problematic
        •   Global knowledge is impractical
        •   Follow the end-to-end principle - put smarts at the edges
  • Advantages
        • Failures become much cheaper
        • Performance is better
  • Disadvantages
        • Have to weaken some delivery guarantees
                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Scalability and reliability in Flume

  • The data path is ‘horizontally scalable’
        • Add more machines, get more performance
        • Typically the bottleneck is write performance at the collector
        • If machines fail, others automatically take their place
  • The master only requires a few machines
        • Consistency and replication handled by ZooKeeper + gossip
        • A cluster of five or seven machines can handle thousands of
          nodes
        • Can add more if you manage to hit the limit



                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Flume as Open Source

  • http://github.com/cloudera/flume
  • Already vibrant contributor community
  • Flume 0.9.1 is at release candidate 0 right now

  • Cloudera provides
        • Packages
        • Standardisation
        • Support




                          Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010
Copyright 2010 Cloudera Inc. All rights reserved


Tuesday, 17 August 2010

Inside Flume

  • 1.
    Inside Flume Henry Robinson henry@cloudera.com @henryr Tuesday, 17 August 2010
  • 2.
    Who am I? • Distributed systems guy • Apache ZooKeeper committer • I work at Cloudera on Flume, ZooKeeper, Hue, more... • p.s. Cloudera is hiring! Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 3.
    About Cloudera • Software, services and support for Hadoop • Built around an open core • All our patches get contributed upstream • Flume and Hue are open-source • We just started the Whirr project • We maintain, package and support Cloudera’s Distribution for Hadoop • Smoothing off a lot of the rough edges around Hadoop • Includes MapReduce, HDFS, HBase, ZooKeeper, Oozie, Hive, Pig, Hue, Flume and more. Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 4.
    What’s the problem? • Data collection is currently a priori and ad hoc • A priori - decide what you want to collect ahead of time • Ad hoc - Each kind of data source goes through its own collection path • Usually a collection of fragile, custom scripts Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 5.
    What is Flume?(and how can it help?) • Flume is: • A distributed data collection service • Scalable • Configurable • Extensible • Manageable • Open source • How can it help? • One-stop solution for data collection of all formats • Flexible reliability guarantees allow careful performance tuning • Enables quick iteration on new collection strategies Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 6.
    The Flume Model • Built around the concept of flows • A single flow corresponds to a type of data source • Like web server logs • Or machine monitoring metrics • Different flows might have different compression, batching or reliability setups • Flume multiplexes many flows onto one service instance • Flows are comprised of nodes chained together • Each Flume process can run many nodes, so resources are shared • Each node receives data at its source, and sends it to its sink Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 7.
    Flume Flows • Three typical flows, all on the same Flume service Flow 1: Web-clicks Reliable Delivery, Compressed, Batched EV A EN D AT TS DATA Flow 2: Process monitoring EVENTS Best Effort Delivery DA TA N TS E VE Flow 3: Advert Impressions Reliable Delivery Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 8.
    Anatomy of aFlume node • Data come in through a source... • ... are optionally processed by one or more decorators... • ... and then are transmitted out via a sink • Each of these components is (re-)configurable at run- time • Each has a very simple API, and a plugin interface that makes customizing Flume very easy • These simple abstractions are sufficient to build more complex features like acknowledged delivery, filtering, compression Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 9.
    Agents and Collectors • Nodes that receive data from an application are called agents • Flume supports many sources for agents, including: • Syslog • Tailing a file • Unix processes • Scribe API • Twitter • Nodes that write data to permanent storage are called collectors • Most often they write to HDFS Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 10.
    Flume Nodes Source Agent Sink HTTPD Tail Apache Downstream HTTPD logs processor node • Each role may be played by many Processor different nodes Source Decorator Sink Extract browser Upstream agent name from log string Downstream node and attach it to event collector node • Usually require substantially fewer collectors than agents Collector Source Sink HDFS:// Upstream namenode/ S HDF processor node /weblogs/ %{browser}/ Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 11.
    Flume Events • All data are transformed into a series of events • Events are a pair (body, metadata) • Body is a string of bytes • Metadata is a table mapping keys to values • Flume can use this to inform processing • Or simply write it with the event Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 12.
    The Flume ConfigurationLanguage • Node configurations are written in a simple language • my-flume-node : src | { decorator => sink } • For example: a configuration to read HTTP log data from a file and send it to a collector: • web-log-agent : tail(“/var/log/httpd.log”) | agentBESink • On the collector, receive data and bucket it according to browser: • web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • Two lines to set-up an entire flow Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 13.
    Keeping Track ofNodes • The master service monitors all Flume nodes • A single port-of-call for checking on the health of your Flume service • Send commands to the master, and it will forward them to the nodes • The Flume Shell is a convenient, scriptable command-line tool • Web-based UIs are also available Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 14.
    Flume as aDistributed System • Fundamental principle: Keep state out of the data path where possible • Replication is costly • Consistency is problematic • Global knowledge is impractical • Follow the end-to-end principle - put smarts at the edges • Advantages • Failures become much cheaper • Performance is better • Disadvantages • Have to weaken some delivery guarantees Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 15.
    Scalability and reliabilityin Flume • The data path is ‘horizontally scalable’ • Add more machines, get more performance • Typically the bottleneck is write performance at the collector • If machines fail, others automatically take their place • The master only requires a few machines • Consistency and replication handled by ZooKeeper + gossip • A cluster of five or seven machines can handle thousands of nodes • Can add more if you manage to hit the limit Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 16.
    Flume as OpenSource • http://github.com/cloudera/flume • Already vibrant contributor community • Flume 0.9.1 is at release candidate 0 right now • Cloudera provides • Packages • Standardisation • Support Copyright 2010 Cloudera Inc. All rights reserved Tuesday, 17 August 2010
  • 17.
    Copyright 2010 ClouderaInc. All rights reserved Tuesday, 17 August 2010