SlideShare a Scribd company logo
1 of 40
Download to read offline
Flume
Reliable Distributed
Streaming Log Collection

Jonathan Hsieh, Henry Robinson, Patrick Hunt
Cloudera, Inc
7/15/2010
Scenario
• Situation:
      – You have hundreds of services producing logs in a datacenter .
      – They produce a lot of logs that you want to analyzed
      – You have Hadoop, a system for processing large volumes of data.


• Problem:
      – How do I reliably ship all my logs to a place that Hadoop can analyze
        them?




7/15/2010                                                                       3
Use cases
• Collecting logs from nodes in
  Hadoop cluster
• Collecting logs from services such
  as httpd, mail, etc.
• Collecting impressions from
  custom apps for an ad network

• But wait, there’s more!
      – Basic metrics of available        It’s log, log .. Everyone wants a log!

      – Basic online in-stream analysis


7/15/2010                                                                          4
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent        Collector
              Agent
               Agent

            Agent
             Agent        Collector         HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent        Collector
              Agent
               Agent

7/15/2010                                                      5
You need a “Flume”
• Flume is a distributed system that gets
  your logs from their source and
  aggregates them to where you want to
  process them.
• Open source, Apache v2.0 License
• Goals:
      – Reliability
      – Scalability
      – Extensibility
      – Manageability
                                      Columbia Gorge, Broughton Log Flume
7/15/2010                                                                   6
Key abstractions
• Data path and control path                                            Agent
• Nodes are in the data path
      – Nodes have a source and a sink
                                                                        Collector
      – They can take different roles
            • A typical topology has agent nodes and collector nodes.
            • Optionally it has processor nodes.
• Masters are in the control path.                                       Master
      – Centralized point of configuration.
      – Specify sources and sinks
      – Can control flows of data between nodes
      – Use one master or use many with a ZK-backed quorum

7/15/2010                                                                           7
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent         Collector
              Agent
               Agent

            Agent
             Agent         Collector        HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                      8
Masters
            Agent tier   Collector tier       Master
            Agent
             Agent         Collector
              Agent
               Agent                       Storage tier

            Agent
             Agent         Collector        HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                      9
Outline
• What is Flume?
      – Goals and architecture
• Reliability
      – Fault-tolerance and High availability
• Scalability
      – Horizontal scalability of all nodes and masters
• Extensibility
      – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
• Manageability
      – Centralized management supporting dynamic reconfiguration

7/15/2010                                                                             10
RELIABILITY


                   The logs will still get there.
7/15/2010                                           11
Failures
• Faults can happen at many levels
      – Software applications can fail
      – Machines can fail
      – Networking gear can fail
      – Excessive networking congestion or machine load
      – A node goes down for maintenance.


• How do we make sure that events make it to a permanent store?



7/15/2010                                                         12
Tunable data reliability levels
• Best effort
      – Fire and forget                    Agent   Collector   HDFS
• Store on failure + retry
      – Local acks, local errors           Agent   Collector   HDFS
        detectable
      – Failover when faults detected.

• End to end reliability                   Agent   Collector
      – End to end acks                                        HDFS
      – Data survives compound failures,
        and may be retried multiple
        times

7/15/2010                                                         13
Dealing with Agent failures
• We do not want to lose data
• Make events durable at the generation point.
      – If a log generator goes down, it is not generating logs.
      – If the event generation point fails and recovers, data will reach the end
        point
            • Data is durable and survive if machines crashes and reboots
      – Allows for synchronous writes in log generating applications.


• Watchdog program to restart agent if it fails.


7/15/2010                                                                           14
Dealing with Collector Failures
• Data is durable at the agent:
      – Minimize the amount of state and possible data loss
      – Not necessary to durably keep intermediate state at collector
      – Retry if collector goes down.


• Use hot failover so agents can use alternate paths:
      – Master predetermines failovers to load balance when collectors go down.




7/15/2010                                                                         15
Master Service Failures
• An master machine should not be the single point of failure!
• Masters keep two kinds of information:

• Configuration information (node/flow configuration)
      – Kept in ZooKeeper ensemble for persistent, highly available metadata store
      – Failures easily recovered from

• Ephemeral information (heartbeat info, acks, metrics reports)
      – Kept in memory
      – Failures will lose data
      – This information can be lazily replicated

7/15/2010                                                                            16
SCALABILITY



7/15/2010
                   Logs jamming the Kemi River   17
A sample topology
            Agent tier   Collector tier       Master
            Agent
             Agent         Collector
              Agent
               Agent

            Agent
             Agent         Collector        HDFS
              Agent
               Agent
                                          /logs/web/2010/0715/1200
                                          /logs/web/2010/0715/1300
                                          /logs/web/2010/0715/1400
            Agent
             Agent         Collector
              Agent
               Agent

7/15/2010                                                     18
Data path is horizontally scalable
            Agent
             Agent                Collector                      HDFS
              Agent
               Agent

• Add collectors to increase availability and to handle more data
      – Assumes a single agent will not dominate a collector
      – Fewer connections to HDFS.
      – Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
     • Write log locally to avoid collector disk IO bottleneck and catastrophic failures
     • Compression and batching (trade cpu for network)
     • Push computation into the event collection pipeline (balance IO, Mem, and CPU
       resource bottlenecks)


7/15/2010                                                                              19
Load balancing
                       Agent
                        Agent                   Collector
                       Agent
                        Agent                   Collector

                       Agent                    Collector
                       Agent

 • Agents are logically partitioned and send to different collectors
 • Use randomization to pre-specify failovers when many collectors
   exist
       • Spread load if a collector goes down.
       • Spread load if new collectors added to the system.


7/15/2010                                                              20
Load balancing
                       Agent
                        Agent                   Collector
                       Agent
                        Agent                   Collector

                       Agent                    Collector
                       Agent

 • Agents are logically partitioned and send to different collectors
 • Use randomization to pre-specify failovers when many collectors
   exist
       • Spread load if a collector goes down.
       • Spread load if new collectors added to the system.


7/15/2010                                                              21
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node                   Master                  ZK2
             Node                   Master            ZK3

• A master controls dynamic configurations of nodes
      – Uses consensus protocol to keep state consistent
      – Scales well for configuration reads
      – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an ZK member
7/15/2010                                                         22
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node                   Master                  ZK2
             Node                   Master            ZK3

• A master controls dynamic configurations of nodes
      – Uses consensus protocol to keep state consistent
      – Scales well for configuration reads
      – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an ZK member
7/15/2010                                                         23
Control plane is horizontally scalable
             Node                   Master            ZK1
             Node                   Master                  ZK2
             Node                   Master            ZK3

• A master controls dynamic configurations of nodes
      – Uses consensus protocol to keep state consistent
      – Scales well for configuration reads
      – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an ZK member
7/15/2010                                                         24
EXTENSIBILITY


                     Turn raw logs into something useful…
7/15/2010                                                   25
Flume is easy to extend
• Simple source and sink APIs
      – Event granularity streaming design
      – Have many simple operations and compose for complex behavior.
• End-to-end principle
      – Put smarts and state at the end points. Keep the middle simple.
• Flume deals with reliability.
      – Just add a new source or add a new sink and Flume has primitives to deal
        with reliability




7/15/2010                                                                          26
Variety of Data sources
• Can deal with push and pull sources.                      push

                                                                     Agent
• Supports many legacy event sources
      – Tailing a file                                        poll

      – Output from periodically Exec’ed program
                                                      App            Agent
      – Syslog, Syslog-ng
      – Experimental: IRC / Twitter / Scribe / AMQP   embed
                                                               App
                                                                     Agent


7/15/2010                                                                    27
Variety of Data output
• Send data to many sinks
      – Files, Hdfs, Console, RPC
      – Experimental: hbase, voldemort, s3, etc..
• Supports an extensible variety of outputs formats and destinations
      – Output to language neutral and open data formats (json, avro, text)
      – Compressed output files in development
• Uses decorators to process event data in flight.
      – Sampling, attribute extraction, filtering, projection, checksumming,
        batching, wire compression, etc..


7/15/2010                                                                      28
MANAGEABILITY



7/15/2010
                     Wheeeeee!   29
Centralized data flow management
• One place to specify node sources, sinks and data flows.
      – Simply specify the role of the node: collector, agent
      – Or specify a custom configuration for a node


• Control Interfaces:
      – Flume Shell
      – Basic web
      – HUE + Flume Manager App (Enterprise users)



7/15/2010                                                       30
Output bucketing
      Collector                               /logs/web/2010/0715/1200/data-xxx.txt
                                              /logs/web/2010/0715/1200/data-xxy.txt
                                              /logs/web/2010/0715/1300/data-xxx.txt
                                 HDFS         /logs/web/2010/0715/1300/data-xxy.txt
                                              /logs/web/2010/0715/1400/data-xxx.txt
      Collector                               …


node : collectorSource | collectorSink
(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)


• Automatic output file management
      – Write hdfs files in over time based tags



7/15/2010                                                                       31
Simplified configurations
• To make configuring flume nodes higher level, we use logical
  nodes.
      – The Flume node process is a physical node
      – Each Flume node process can host multiple logical nodes


• Allows for:
      – Reduces the amount of detail required in configurations.
      – Reduces management process-centric management overhead
      – Allows for finer-grained resource control and isolation with flows


7/15/2010                                                                    32
Flow Isolation
                     Agent
                     Agent               Collector
                     Agent
                     Agent               Collector
                     Agent
                     Agent               Collector

• Isolate different kinds of data when and where it is generated
      – Have multiple logical nodes on a machine
      – Each has their own data source
      – Each has their own data sink

7/15/2010                                                          33
Flow Isolation
                     Agent
                     Agent                Collector
                     Agent
                     Agent               Collector
                     Agent
                     Agent               Collector

• Isolate different kinds of data when it is generated
      – Have multiple logical nodes on a machine
      – Each has their own data source
      – Each has their own data sink

7/15/2010                                                34
For advanced users
• A concise and precise configuration language for specifying
  arbitrary data paths.
      – Dataflows are essentially DAGs
      – Control specific event flows
            • Enable durability mechanism and failover mechanisms
            • Tune the parameters these mechanisms
      – Dynamic updates of configurations
            • Allows for live failover changes
            • Allows for handling newly provisioned machines
            • Allows for changing analytics



7/15/2010                                                           35
CONCLUSIONS



7/15/2010          36
Summary
• Flume is a distributed, reliable, scalable, system for collecting and
  delivering high-volume continuous event data such as logs
      – Tunable data reliability levels for day
      – Reliable master backed by ZK
      – Write data to HDFS into buckets ready for batch processing
      – Dynamically configurable node
      – Simplified automated management for agent+collector topologies


• Open Source Apache v2.0.


7/15/2010                                                                 37
Contribute!
• GitHub source repo
      – http://github.com/cloudera/flume
• Mailing lists
      – User: https://groups.google.com/a/cloudera.org/group/flume-user
      – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
• Development trackers
      – JIRA (bugs/ formal feature requests):
            • https://issues.cloudera.org/browse/FLUME
      – Review board (code reviews):
            • http://review.hbase.org -> http://review.cloudera.org
• IRC Channels
      – #flume @ irc.freenode.net


7/15/2010                                                                 38
Image credits
•   http://www.flickr.com/photos/victorvonsalza/3327750057/
•   http://www.flickr.com/photos/victorvonsalza/3207639929/
•   http://www.flickr.com/photos/victorvonsalza/3327750059/
•   http://www.emvergeoning.com/?m=200811
•   http://www.flickr.com/photos/juse/188960076/
•   http://www.flickr.com/photos/juse/188960076/
•   http://www.flickr.com/photos/23720661@N08/3186507302/
•   http://clarksoutdoorchairs.com/log_adirondack_chairs.html
•   http://www.flickr.com/photos/dboo/3314299591/
7/15/2010                                                       40
Flume intro-100717

More Related Content

Similar to Flume intro-100717

Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrierFlytxt
 
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationBuilding Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationLinas Virbalas
 
Living the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLiving the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLinas Virbalas
 
Überwachung virtueller Umgebungen
Überwachung virtueller UmgebungenÜberwachung virtueller Umgebungen
Überwachung virtueller UmgebungenStefan Bergstein
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireJohn Blum
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
AppResponse Xpert SaaS Edition
AppResponse Xpert SaaS EditionAppResponse Xpert SaaS Edition
AppResponse Xpert SaaS EditionGeneXus
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETLLily Luo
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio, Inc.
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiTimothy Spann
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureJianfeng Zhang
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and FutureRajesh Balamohan
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiDataWorks Summit
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1Agenor Technology Ltd
 
1 sysadmin vs 250 clusters de stockage
1 sysadmin vs 250 clusters de stockage1 sysadmin vs 250 clusters de stockage
1 sysadmin vs 250 clusters de stockageOVHcloud
 

Similar to Flume intro-100717 (20)

Hadoop for carrier
Hadoop for carrierHadoop for carrier
Hadoop for carrier
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming ReplicationBuilding Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
Building Tungsten Clusters with PostgreSQL Hot Standby and Streaming Replication
 
Living the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database ClustersLiving the Easy Life with Rules-Based Autonomic Database Clusters
Living the Easy Life with Rules-Based Autonomic Database Clusters
 
Überwachung virtueller Umgebungen
Überwachung virtueller UmgebungenÜberwachung virtueller Umgebungen
Überwachung virtueller Umgebungen
 
Introducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFireIntroducing Apache Geode and Spring Data GemFire
Introducing Apache Geode and Spring Data GemFire
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
AppResponse Xpert SaaS Edition
AppResponse Xpert SaaS EditionAppResponse Xpert SaaS Edition
AppResponse Xpert SaaS Edition
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
Alluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory SpeedAlluxio: Unify Data at Memory Speed
Alluxio: Unify Data at Memory Speed
 
Apache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFiApache MXNet for IoT with Apache NiFi
Apache MXNet for IoT with Apache NiFi
 
Geode Meetup Apachecon
Geode Meetup ApacheconGeode Meetup Apachecon
Geode Meetup Apachecon
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1ICEflo Implementation Management Solution V1d1
ICEflo Implementation Management Solution V1d1
 
1 sysadmin vs 250 clusters de stockage
1 sysadmin vs 250 clusters de stockage1 sysadmin vs 250 clusters de stockage
1 sysadmin vs 250 clusters de stockage
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Flume intro-100717

  • 1.
  • 2. Flume Reliable Distributed Streaming Log Collection Jonathan Hsieh, Henry Robinson, Patrick Hunt Cloudera, Inc 7/15/2010
  • 3. Scenario • Situation: – You have hundreds of services producing logs in a datacenter . – They produce a lot of logs that you want to analyzed – You have Hadoop, a system for processing large volumes of data. • Problem: – How do I reliably ship all my logs to a place that Hadoop can analyze them? 7/15/2010 3
  • 4. Use cases • Collecting logs from nodes in Hadoop cluster • Collecting logs from services such as httpd, mail, etc. • Collecting impressions from custom apps for an ad network • But wait, there’s more! – Basic metrics of available It’s log, log .. Everyone wants a log! – Basic online in-stream analysis 7/15/2010 4
  • 5. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 5
  • 6. You need a “Flume” • Flume is a distributed system that gets your logs from their source and aggregates them to where you want to process them. • Open source, Apache v2.0 License • Goals: – Reliability – Scalability – Extensibility – Manageability Columbia Gorge, Broughton Log Flume 7/15/2010 6
  • 7. Key abstractions • Data path and control path Agent • Nodes are in the data path – Nodes have a source and a sink Collector – They can take different roles • A typical topology has agent nodes and collector nodes. • Optionally it has processor nodes. • Masters are in the control path. Master – Centralized point of configuration. – Specify sources and sinks – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum 7/15/2010 7
  • 8. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 8
  • 9. Masters Agent tier Collector tier Master Agent Agent Collector Agent Agent Storage tier Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 9
  • 10. Outline • What is Flume? – Goals and architecture • Reliability – Fault-tolerance and High availability • Scalability – Horizontal scalability of all nodes and masters • Extensibility – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks • Manageability – Centralized management supporting dynamic reconfiguration 7/15/2010 10
  • 11. RELIABILITY The logs will still get there. 7/15/2010 11
  • 12. Failures • Faults can happen at many levels – Software applications can fail – Machines can fail – Networking gear can fail – Excessive networking congestion or machine load – A node goes down for maintenance. • How do we make sure that events make it to a permanent store? 7/15/2010 12
  • 13. Tunable data reliability levels • Best effort – Fire and forget Agent Collector HDFS • Store on failure + retry – Local acks, local errors Agent Collector HDFS detectable – Failover when faults detected. • End to end reliability Agent Collector – End to end acks HDFS – Data survives compound failures, and may be retried multiple times 7/15/2010 13
  • 14. Dealing with Agent failures • We do not want to lose data • Make events durable at the generation point. – If a log generator goes down, it is not generating logs. – If the event generation point fails and recovers, data will reach the end point • Data is durable and survive if machines crashes and reboots – Allows for synchronous writes in log generating applications. • Watchdog program to restart agent if it fails. 7/15/2010 14
  • 15. Dealing with Collector Failures • Data is durable at the agent: – Minimize the amount of state and possible data loss – Not necessary to durably keep intermediate state at collector – Retry if collector goes down. • Use hot failover so agents can use alternate paths: – Master predetermines failovers to load balance when collectors go down. 7/15/2010 15
  • 16. Master Service Failures • An master machine should not be the single point of failure! • Masters keep two kinds of information: • Configuration information (node/flow configuration) – Kept in ZooKeeper ensemble for persistent, highly available metadata store – Failures easily recovered from • Ephemeral information (heartbeat info, acks, metrics reports) – Kept in memory – Failures will lose data – This information can be lazily replicated 7/15/2010 16
  • 17. SCALABILITY 7/15/2010 Logs jamming the Kemi River 17
  • 18. A sample topology Agent tier Collector tier Master Agent Agent Collector Agent Agent Agent Agent Collector HDFS Agent Agent /logs/web/2010/0715/1200 /logs/web/2010/0715/1300 /logs/web/2010/0715/1400 Agent Agent Collector Agent Agent 7/15/2010 18
  • 19. Data path is horizontally scalable Agent Agent Collector HDFS Agent Agent • Add collectors to increase availability and to handle more data – Assumes a single agent will not dominate a collector – Fewer connections to HDFS. – Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs • Write log locally to avoid collector disk IO bottleneck and catastrophic failures • Compression and batching (trade cpu for network) • Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) 7/15/2010 19
  • 20. Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 20
  • 21. Load balancing Agent Agent Collector Agent Agent Collector Agent Collector Agent • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. 7/15/2010 21
  • 22. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 22
  • 23. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 23
  • 24. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an ZK member 7/15/2010 24
  • 25. EXTENSIBILITY Turn raw logs into something useful… 7/15/2010 25
  • 26. Flume is easy to extend • Simple source and sink APIs – Event granularity streaming design – Have many simple operations and compose for complex behavior. • End-to-end principle – Put smarts and state at the end points. Keep the middle simple. • Flume deals with reliability. – Just add a new source or add a new sink and Flume has primitives to deal with reliability 7/15/2010 26
  • 27. Variety of Data sources • Can deal with push and pull sources. push Agent • Supports many legacy event sources – Tailing a file poll – Output from periodically Exec’ed program App Agent – Syslog, Syslog-ng – Experimental: IRC / Twitter / Scribe / AMQP embed App Agent 7/15/2010 27
  • 28. Variety of Data output • Send data to many sinks – Files, Hdfs, Console, RPC – Experimental: hbase, voldemort, s3, etc.. • Supports an extensible variety of outputs formats and destinations – Output to language neutral and open data formats (json, avro, text) – Compressed output files in development • Uses decorators to process event data in flight. – Sampling, attribute extraction, filtering, projection, checksumming, batching, wire compression, etc.. 7/15/2010 28
  • 29. MANAGEABILITY 7/15/2010 Wheeeeee! 29
  • 30. Centralized data flow management • One place to specify node sources, sinks and data flows. – Simply specify the role of the node: collector, agent – Or specify a custom configuration for a node • Control Interfaces: – Flume Shell – Basic web – HUE + Flume Manager App (Enterprise users) 7/15/2010 30
  • 31. Output bucketing Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt Collector … node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) • Automatic output file management – Write hdfs files in over time based tags 7/15/2010 31
  • 32. Simplified configurations • To make configuring flume nodes higher level, we use logical nodes. – The Flume node process is a physical node – Each Flume node process can host multiple logical nodes • Allows for: – Reduces the amount of detail required in configurations. – Reduces management process-centric management overhead – Allows for finer-grained resource control and isolation with flows 7/15/2010 32
  • 33. Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when and where it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 33
  • 34. Flow Isolation Agent Agent Collector Agent Agent Collector Agent Agent Collector • Isolate different kinds of data when it is generated – Have multiple logical nodes on a machine – Each has their own data source – Each has their own data sink 7/15/2010 34
  • 35. For advanced users • A concise and precise configuration language for specifying arbitrary data paths. – Dataflows are essentially DAGs – Control specific event flows • Enable durability mechanism and failover mechanisms • Tune the parameters these mechanisms – Dynamic updates of configurations • Allows for live failover changes • Allows for handling newly provisioned machines • Allows for changing analytics 7/15/2010 35
  • 37. Summary • Flume is a distributed, reliable, scalable, system for collecting and delivering high-volume continuous event data such as logs – Tunable data reliability levels for day – Reliable master backed by ZK – Write data to HDFS into buckets ready for batch processing – Dynamically configurable node – Simplified automated management for agent+collector topologies • Open Source Apache v2.0. 7/15/2010 37
  • 38. Contribute! • GitHub source repo – http://github.com/cloudera/flume • Mailing lists – User: https://groups.google.com/a/cloudera.org/group/flume-user – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev • Development trackers – JIRA (bugs/ formal feature requests): • https://issues.cloudera.org/browse/FLUME – Review board (code reviews): • http://review.hbase.org -> http://review.cloudera.org • IRC Channels – #flume @ irc.freenode.net 7/15/2010 38
  • 39. Image credits • http://www.flickr.com/photos/victorvonsalza/3327750057/ • http://www.flickr.com/photos/victorvonsalza/3207639929/ • http://www.flickr.com/photos/victorvonsalza/3327750059/ • http://www.emvergeoning.com/?m=200811 • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/juse/188960076/ • http://www.flickr.com/photos/23720661@N08/3186507302/ • http://clarksoutdoorchairs.com/log_adirondack_chairs.html • http://www.flickr.com/photos/dboo/3314299591/ 7/15/2010 40