SlideShare a Scribd company logo
Flume
Reliable Distributed
Streaming Log Collection

Jonathan Hsieh, Henry Robinson, Patrick Hunt
Cloudera, Inc
Hadoop World 2010, 10/12/2010
Flume
4 months after Hadoop
World 2010

Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener
Cloudera, Inc
Austin Hadoop Users Group 2/17/2011
Who Am I?
                                        • Cloudera:
                                          – Software Engineer on the Platform Team
                                          – Flume Project Lead / Designer / Architect
                                        • U of Washington:
                                          – “On Leave” from PhD program
                                          – Research in Systems and Programming
                                            Languages
                                        • Previously:
                                          – Computer Security, Embedded Systems.


  Austin Hadoop User Group, 2/17/2011                                                   4
The basic scenario
• You have a bunch of servers
  generating log files.
• You figured out that your logs are
  valuable and you want to keep them
  and analyze them.
• Because of the volume of data,
  you’ve started using a Apache
  Hadoop or Cloudera’s Distribution of
  Apache Hadoop.
• … and you’ve got some ad-hoc,           It’s log, log .. Everyone wants a log!

  hacked together scripts that copy
  data from servers to HDFS.
    Austin Hadoop User Group, 2/17/2011                                            5
Ad-hockery gets complicated
• Reliability
   – Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes
     down? … if EC2 has flaked out?
• Scale
   – As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small
     files? Are you going to have tons of connections? Are you willing to suffer more latency to
     mitigate?
• Manageability
   – How do you know if the script failed on machine 172? What about logs from that other
     system? How do you monitor and configure all the servers? Can you deal with elasticity?
• Extensibility
   – Can you service custom logs? Send data to different places like Hbase, Hive or Incremental
     search indexes? Can you do near-realtime?
• Blackbox
   – What happens when the guy who write it leaves?


     Austin Hadoop User Group, 2/17/2011                                                               6
Cloudera Flume
Flume is a framework and conduit for
collecting and quickly shipping data records
from of many sources and to one centralized
place for storage and processing.

Project Principles:
• Scalability
• Reliability
• Extensibility
• Manageability
• Openness

    Austin Hadoop User Group, 2/17/2011        7
: The Standard Use Case

server Agent
 server Agent                             Collector
  server Agent
   server Agent

server Agent
 server Agent                             Collector
  server Agent
   server Agent                                            HDFS


server Agent
 server Agent                             Collector
  server Agent
   server Agent
                Agent tier                Collector tier


    Austin Hadoop User Group, 2/17/2011                           8
: The Standard Use Case
         Flume
server Agent
 server Agent                             Collector
  server Agent
   server Agent

server Agent
 server Agent                             Collector
  server Agent
   server Agent                                            HDFS


server Agent
 server Agent                             Collector
  server Agent
   server Agent
                Agent tier                Collector tier


    Austin Hadoop User Group, 2/17/2011                           9
: The Standard Use Case
         Flume                                             Master
server Agent
 server Agent                             Collector
  server Agent
   server Agent

server Agent
 server Agent                             Collector
  server Agent
   server Agent                                            HDFS


server Agent
 server Agent                             Collector
  server Agent
   server Agent
                Agent tier                Collector tier


    Austin Hadoop User Group, 2/17/2011                             10
: The Standard Use Case
         Flume                                             Master
server Agent
 server Agent                             Collector
  server Agent
   server Agent

server Agent
 server Agent                             Collector
  server Agent
   server Agent                                            HDFS


server Agent
 server Agent                             Collector
  server Agent
   server Agent
                Agent tier                Collector tier


    Austin Hadoop User Group, 2/17/2011                             11
Flume’s Key Abstractions
• Data path and control path                                          node
• Nodes are in the data path                                          Agent
                                                                   source   sink
  – Nodes have a source and a sink
  – They can take different roles                                    node
       • A typical topology has agent nodes and collector nodes.
                                                                    Collector
                                                                   source   sink
       • Optionally it has processor nodes.
• Masters are in the control path.
  – Centralized point of configuration.
  – Specify sources and sinks                          Master
  – Can control flows of data between nodes
  – Use one master or use many with a ZK-backed quorum

    Austin Hadoop User Group, 2/17/2011                                            12
Flume’s Key Abstractions
• Data path and control path                                           node
• Nodes are in the data path                                       source   sink
  – Nodes have a source and a sink
  – They can take different roles                                      node
       • A typical topology has agent nodes and collector nodes.   source   sink
       • Optionally it has processor nodes.
• Masters are in the control path.
  – Centralized point of configuration.
  – Specify sources and sinks                          Master
  – Can control flows of data between nodes
  – Use one master or use many with a ZK-backed quorum

    Austin Hadoop User Group, 2/17/2011                                            13
Can I has the codez?
node001: tail(“/var/log/app/log”) | autoE2ESink;
node002: tail(“/var/log/app/log”) | autoE2ESink;
…
node100: tail(“/var/log/app/log”) | autoE2ESink;

collector1: autoCollectorSource |
  collectorSink(“hdfs://logs/app/”,”applogs”)
collector2: autoCollectorSource |
  collectorSink(“hdfs://logs/app/”,”applogs”)
collector3: autoCollectorSource |
  collectorSink(“hdfs://logs/app/”,”applogs”)


   Austin Hadoop User Group, 2/17/2011             14
Outline
• What is Flume?
• Scalability
   – Horizontal scalability of all nodes and masters
• Reliability
   – Fault-tolerance and High availability
• Extensibility
   – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
• Manageability
   – Centralized management supporting dynamic reconfiguration
• Openness
   – Apache v2.0 License and an active and growing community

    Austin Hadoop User Group, 2/17/2011                                            15
SCALABILITY



 Austin Hadoop User Group, 2/17/2011   16
: The Standard Use Case
         Flume
server Agent
 server Agent                             Collector
  server Agent
   server Agent

server Agent
 server Agent                             Collector
  server Agent
   server Agent                                            HDFS


server Agent
 server Agent                             Collector
  server Agent
   server Agent
                Agent tier                Collector tier


    Austin Hadoop User Group, 2/17/2011                           17
Data path is horizontally scalable
server Agent
 server Agent                             Collector
  server Agent
   server Agent                                                  HDFS

• Add collectors to increase availability and to handle more data
  – Assumes a single agent will not dominate a collector
  – Fewer connections to HDFS.
  – Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
  • Write log locally to avoid collector disk IO bottleneck and catastrophic failures
  • Compression and batching (trade cpu for network)
  • Push computation into the event collection pipeline (balance IO, Mem, and CPU
    resource bottlenecks)


    Austin Hadoop User Group, 2/17/2011                                             18
RELIABILITY



 Austin Hadoop User Group, 2/17/2011   19
Tunable failure recovery modes
• Best effort
   – Fire and forget                      Agent   Collector   HDFS
• Store on failure + retry
   – Local acks, local errors             Agent   Collector   HDFS
     detectable
   – Failover when faults detected.

• End to end reliability                  Agent   Collector
   – End to end acks                                          HDFS
   – Data survives compound failures,
     and may be retried multiple
     times

    Austin Hadoop User Group, 2/17/2011                          20
Load balancing
                               Agent
                                Agent      Collector
                               Agent
                                Agent      Collector
                               Agent
                                Agent      Collector

• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
  exist
  • Spread load if a collector goes down.
  • Spread load if new collectors added to the system.


   Austin Hadoop User Group, 2/17/2011                                21
Load balancing and collector failover
                               Agent
                                Agent      Collector
                               Agent
                                Agent      Collector
                               Agent
                                Agent      Collector

• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
  exist
  • Spread load if a collector goes down.
  • Spread load if new collectors added to the system.


   Austin Hadoop User Group, 2/17/2011                                22
Control plane is horizontally scalable
               Node                      Master    ZK1
               Node                      Master          ZK2
               Node                      Master    ZK3

• A master controls dynamic configurations of nodes
  – Uses consensus protocol to keep state consistent
  – Scales well for configuration reads
  – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
   Austin Hadoop User Group, 2/17/2011                         23
Control plane is horizontally scalable
               Node                      Master    ZK1
               Node                      Master          ZK2
               Node                      Master    ZK3

• A master controls dynamic configurations of nodes
  – Uses consensus protocol to keep state consistent
  – Scales well for configuration reads
  – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
   Austin Hadoop User Group, 2/17/2011                         24
Control plane is horizontally scalable
               Node                      Master    ZK1
               Node                      Master          ZK2
               Node                      Master    ZK3

• A master controls dynamic configurations of nodes
  – Uses consensus protocol to keep state consistent
  – Scales well for configuration reads
  – Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
   Austin Hadoop User Group, 2/17/2011                         25
MANAGEABILITY



 Austin Hadoop User Group, 2/17/2011
                                       Wheeeeee!   26
Centralized Dataflow Management Interfaces
• One place to specify node
  sources, sinks and data
  flows.

• Basic Web interface
• Flume Shell
  – Command line interface
  – Scriptable
• Cloudera Enterprise
  – Flume Monitor App
  – Graphical web interface

    Austin Hadoop User Group, 2/17/2011      27
Configuring Flume

                                            fan   console
               tail                filter
                                            out     roll    hdfs
Node: tail(“file”) | filter [ console, roll(1000) {
  dfs(“hdfs://namenode/user/flume”) } ] ;
• A concise and precise configuration language for specifying dataflows in
  a node.
• Dynamic updates of configurations
  – Allows for live failover changes
  – Allows for handling newly provisioned machines
  – Allows for changing analytics

    Austin Hadoop User Group, 2/17/2011                                  28
Output bucketing
   Collector                                    /logs/web/2010/0715/1200/data-xxx.txt
                                                /logs/web/2010/0715/1200/data-xxy.txt
                                                /logs/web/2010/0715/1300/data-xxx.txt
                                         HDFS   /logs/web/2010/0715/1300/data-xxy.txt
                                                /logs/web/2010/0715/1400/data-xxx.txt
                                                …
   Collector
node : collectorSource | collectorSink
(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)


• Automatic output file management
  – Write hdfs files in over time based tags



   Austin Hadoop User Group, 2/17/2011                                            29
EXTENSIBILITY



 Austin Hadoop User Group, 2/17/2011   30
Flume is easy to extend
• Simple source and sink APIs
  – An event streaming design
  – Many simple operations composes for complex behavior


• Plug-in architecture so you can add your own sources, sinks and
  decorators and sinks

                                         fan   sink
         source                   deco
                                         out   deco   sink

   Austin Hadoop User Group, 2/17/2011                              31
Variety of Connectors
• Sources produce data
   – Console, Exec, Syslog, Scribe, IRC, Twitter,
   – In the works: JMS, AMQP, pubsubhubbub/RSS/Atom
• Sinks consume data                                            source
   – Console, Local files, HDFS, S3
   – Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra
     (Riptano/DataStax), Voldemort, Elastic Search
   – In the works: JMS, AMQP
                                                                  sink
• Decorators modify data sent to sinks
   – Wire batching, compression, sampling, projection,
     extraction, throughput throttling
   – Custom near real-time processing (Meebo)
   – JRuby event modifiers (InfoChimps)                          deco
   – Cryptographic extensions(Rearden)



    Austin Hadoop User Group, 2/17/2011                                  32
: Multi Datacenter
                                                          Collector tier
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
API server




                    api       Agent
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
                    api       Agent
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
                    api       Agent                                        HDFS
Processor server




                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                    Austin Hadoop User Group, 2/17/2011                       33
: Multi Datacenter
                                                          Collector tier
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
API server




                    api       Agent
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
                    api       Agent
                   api
                   api
                    api      Agent
                             Agent
                              Agent                        Collector
                    api       Agent                                        Relay   HDFS
Processor server




                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                   api Agent
                   api Agent
                    api Agent
                    proc Agent                             Collector
                    Austin Hadoop User Group, 2/17/2011                               34
: Near Realtime Aggregator
              Flume
Ad svr Agent
 Ad svr Agent                            Tracker    Collector         HDFS
  Ad svr Agent
   Ad svr Agent


                                quick
                               reports   DB                           Hive job
                                                   verify


                                                            reports



  Austin Hadoop User Group, 2/17/2011                                            35
An enterprise story
                         Flume
                                                   Collector tier
             api
             api
              api       Agent
                        Agent
                         Agent                      Collector
API server




              api          Win
                                                                           Kerberos HDFS
             api
             api
              api       Agent
                        Agent
                         Agent                      Collector
              api         Linux
                                                                            DD   DD   DD
             api
             api
              api       Agent
                        Agent
                         Agent                      Collector
              api         Linux



                                                              Active Directory
                                                                   / LDAP
             Austin Hadoop User Group, 2/17/2011                                           36
An emerging community story

   Flume
  Agent
   Agent                                                              Hive query
svr Agent
     Agent                                              HDFS           Pig query

                                                hdfs
                                                                      Key lookup
        Collector                       Fanout hbase   HBase         Range query

                                               index
                                                       Incremental   Search query
                                                        Search Idx   Faceted query



  Austin Hadoop User Group, 2/17/2011                                                37
OPENNESS AND
COMMUNITY

 Austin Hadoop User Group, 2/17/2011   38
Flume is Open Source
• Apache v2.0 Open Source License
  – Independent from Apache Software Foundation
• GitHub source code repository
  – http://github.com/cloudera/flume
  – Regular tarball update versions every 2-3 months.
  – Regular CDH packaging updates every 3-4 months.
• Review Board for code review
• New external committers wanted!
  – Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric
    Sammer
  – Independent folks: Bruce Mitchener
   Austin Hadoop User Group, 2/17/2011                                   39
Growing user and developer community
• History:
  – Initial Open Source Release, June 2010
• Growth:
  – Pre-Hadoop Summit (Late June 2010):
       • 4 followers, 4 forks (original authors)
  – Pre-Hadoop World (October 2010):
       • 174 followers, 34 forks
  – Pre-CDH3B4 Release (February 2011):
       • 288 followers, 51 forks



    Austin Hadoop User Group, 2/17/2011            40
Support
• Community-based mailing lists for support
  – “an answer in a few days”
  – User: https://groups.google.com/a/cloudera.org/group/flume-user
  – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
• Community-based IRC chat room
  – “quick questions, quick answers”
  – #flume in irc.freenode.net
• Commercial support with Cloudera Enterprise subscription
  – Chat with sales@cloudera.com


   Austin Hadoop User Group, 2/17/2011                                41
CONCLUSIONS



 Austin Hadoop User Group, 2/17/2011   42
Summary
• Flume is a distributed, reliable, scalable, extensible system for
  collecting and delivering high-volume continuous event data such
  as logs.
  – It is centrally managed, which allows for automated and adaptive
    configurations.
  – This design allows for near-real time processing.
  – Apache v2.0 License with active and growing community


• Part of Cloudera’s Distribution for Hadoop, about to be refreshed
  for CDH3b4.

    Austin Hadoop User Group, 2/17/2011                                43
Questions? (and shameless plugs)
• Contact info:
  – jon@cloudera.com
  – Twitter @jmhsieh


• Cloudera Training in Dallas
  –    Hadoop Training for Developers - March 14-16
  –    Hadoop Training for Administrators - March 17-18
  –    Sign up at http://cloudera.eventbrite.com
  –    10% discount code for classes "hug“

• Cloudera is Hiring!

      Austin Hadoop User Group, 2/17/2011                 44
Flume @ Austin HUG 2/17/11

More Related Content

What's hot

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
Erik Schmiegelow
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
dwmclary
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
GetInData
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
Rapheephan Thongkham-Uan
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
Jayesh Thakrar
 
Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
Alexander Alten
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
Hanborq Inc.
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
Cloudera, Inc.
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
Jayesh Thakrar
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
Alexander Alten
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
AnandMHadoop
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
DataWorks Summit
 
Gummadi-47-Shadowbase-Technical-Overview.Final
Gummadi-47-Shadowbase-Technical-Overview.FinalGummadi-47-Shadowbase-Technical-Overview.Final
Gummadi-47-Shadowbase-Technical-Overview.Final
ajaya gummadi
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
HBase state of the union
HBase   state of the unionHBase   state of the union
HBase state of the union
enissoz
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
enissoz
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter Integration
RockyCIce
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
AnandMHadoop
 

What's hot (20)

Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
ApacheCon-HBase-2016
ApacheCon-HBase-2016ApacheCon-HBase-2016
ApacheCon-HBase-2016
 
Highlights Of Sqoop2
Highlights Of Sqoop2Highlights Of Sqoop2
Highlights Of Sqoop2
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
Gummadi-47-Shadowbase-Technical-Overview.Final
Gummadi-47-Shadowbase-Technical-Overview.FinalGummadi-47-Shadowbase-Technical-Overview.Final
Gummadi-47-Shadowbase-Technical-Overview.Final
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
 
HBase state of the union
HBase   state of the unionHBase   state of the union
HBase state of the union
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
 
Flume with Twitter Integration
Flume with Twitter IntegrationFlume with Twitter Integration
Flume with Twitter Integration
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
 

Similar to Flume @ Austin HUG 2/17/11

Flumetalk
FlumetalkFlumetalk
Flumetalk
Skills Matter
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
Swapnil Dubey
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
Alexander Alten
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
Derek Chen
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
DataWorks Summit
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Kevin Crocker
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
Data Con LA
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
Subhas Kumar Ghosh
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
Cloudera, Inc.
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
GlusterFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
Steve Loughran
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
Fabrice dos Santos
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
Chris Nauroth
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
GlusterFS
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
NETWAYS
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
Chris Nauroth
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
Ratnakar Pawar
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
Biju Nair
 

Similar to Flume @ Austin HUG 2/17/11 (20)

Flumetalk
FlumetalkFlumetalk
Flumetalk
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Hadoop data management
Hadoop data managementHadoop data management
Hadoop data management
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredOSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio Manfred
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Hadoop security
Hadoop securityHadoop security
Hadoop security
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
c5vrf27qcz
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
Fwdays
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Neo4j
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
christinelarrosa
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
Jason Yip
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
Javier Junquera
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
Enterprise Knowledge
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 

Recently uploaded (20)

Y-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PPY-Combinator seed pitch deck template PP
Y-Combinator seed pitch deck template PP
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
"Scaling RAG Applications to serve millions of users",  Kevin Goedecke"Scaling RAG Applications to serve millions of users",  Kevin Goedecke
"Scaling RAG Applications to serve millions of users", Kevin Goedecke
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and BioinformaticiansBiomedical Knowledge Graphs for Data Scientists and Bioinformaticians
Biomedical Knowledge Graphs for Data Scientists and Bioinformaticians
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Christine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptxChristine's Product Research Presentation.pptx
Christine's Product Research Presentation.pptx
 
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...
 
GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)GNSS spoofing via SDR (Criptored Talks 2024)
GNSS spoofing via SDR (Criptored Talks 2024)
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Demystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through StorytellingDemystifying Knowledge Management through Storytelling
Demystifying Knowledge Management through Storytelling
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 

Flume @ Austin HUG 2/17/11

  • 1.
  • 2. Flume Reliable Distributed Streaming Log Collection Jonathan Hsieh, Henry Robinson, Patrick Hunt Cloudera, Inc Hadoop World 2010, 10/12/2010
  • 3. Flume 4 months after Hadoop World 2010 Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener Cloudera, Inc Austin Hadoop Users Group 2/17/2011
  • 4. Who Am I? • Cloudera: – Software Engineer on the Platform Team – Flume Project Lead / Designer / Architect • U of Washington: – “On Leave” from PhD program – Research in Systems and Programming Languages • Previously: – Computer Security, Embedded Systems. Austin Hadoop User Group, 2/17/2011 4
  • 5. The basic scenario • You have a bunch of servers generating log files. • You figured out that your logs are valuable and you want to keep them and analyze them. • Because of the volume of data, you’ve started using a Apache Hadoop or Cloudera’s Distribution of Apache Hadoop. • … and you’ve got some ad-hoc, It’s log, log .. Everyone wants a log! hacked together scripts that copy data from servers to HDFS. Austin Hadoop User Group, 2/17/2011 5
  • 6. Ad-hockery gets complicated • Reliability – Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes down? … if EC2 has flaked out? • Scale – As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small files? Are you going to have tons of connections? Are you willing to suffer more latency to mitigate? • Manageability – How do you know if the script failed on machine 172? What about logs from that other system? How do you monitor and configure all the servers? Can you deal with elasticity? • Extensibility – Can you service custom logs? Send data to different places like Hbase, Hive or Incremental search indexes? Can you do near-realtime? • Blackbox – What happens when the guy who write it leaves? Austin Hadoop User Group, 2/17/2011 6
  • 7. Cloudera Flume Flume is a framework and conduit for collecting and quickly shipping data records from of many sources and to one centralized place for storage and processing. Project Principles: • Scalability • Reliability • Extensibility • Manageability • Openness Austin Hadoop User Group, 2/17/2011 7
  • 8. : The Standard Use Case server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent HDFS server Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 8
  • 9. : The Standard Use Case Flume server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent HDFS server Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 9
  • 10. : The Standard Use Case Flume Master server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent HDFS server Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 10
  • 11. : The Standard Use Case Flume Master server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent HDFS server Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 11
  • 12. Flume’s Key Abstractions • Data path and control path node • Nodes are in the data path Agent source sink – Nodes have a source and a sink – They can take different roles node • A typical topology has agent nodes and collector nodes. Collector source sink • Optionally it has processor nodes. • Masters are in the control path. – Centralized point of configuration. – Specify sources and sinks Master – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum Austin Hadoop User Group, 2/17/2011 12
  • 13. Flume’s Key Abstractions • Data path and control path node • Nodes are in the data path source sink – Nodes have a source and a sink – They can take different roles node • A typical topology has agent nodes and collector nodes. source sink • Optionally it has processor nodes. • Masters are in the control path. – Centralized point of configuration. – Specify sources and sinks Master – Can control flows of data between nodes – Use one master or use many with a ZK-backed quorum Austin Hadoop User Group, 2/17/2011 13
  • 14. Can I has the codez? node001: tail(“/var/log/app/log”) | autoE2ESink; node002: tail(“/var/log/app/log”) | autoE2ESink; … node100: tail(“/var/log/app/log”) | autoE2ESink; collector1: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector2: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) collector3: autoCollectorSource | collectorSink(“hdfs://logs/app/”,”applogs”) Austin Hadoop User Group, 2/17/2011 14
  • 15. Outline • What is Flume? • Scalability – Horizontal scalability of all nodes and masters • Reliability – Fault-tolerance and High availability • Extensibility – Unix principle, all kinds of data, all kinds of sources, all kinds of sinks • Manageability – Centralized management supporting dynamic reconfiguration • Openness – Apache v2.0 License and an active and growing community Austin Hadoop User Group, 2/17/2011 15
  • 16. SCALABILITY Austin Hadoop User Group, 2/17/2011 16
  • 17. : The Standard Use Case Flume server Agent server Agent Collector server Agent server Agent server Agent server Agent Collector server Agent server Agent HDFS server Agent server Agent Collector server Agent server Agent Agent tier Collector tier Austin Hadoop User Group, 2/17/2011 17
  • 18. Data path is horizontally scalable server Agent server Agent Collector server Agent server Agent HDFS • Add collectors to increase availability and to handle more data – Assumes a single agent will not dominate a collector – Fewer connections to HDFS. – Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs • Write log locally to avoid collector disk IO bottleneck and catastrophic failures • Compression and batching (trade cpu for network) • Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks) Austin Hadoop User Group, 2/17/2011 18
  • 19. RELIABILITY Austin Hadoop User Group, 2/17/2011 19
  • 20. Tunable failure recovery modes • Best effort – Fire and forget Agent Collector HDFS • Store on failure + retry – Local acks, local errors Agent Collector HDFS detectable – Failover when faults detected. • End to end reliability Agent Collector – End to end acks HDFS – Data survives compound failures, and may be retried multiple times Austin Hadoop User Group, 2/17/2011 20
  • 21. Load balancing Agent Agent Collector Agent Agent Collector Agent Agent Collector • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. Austin Hadoop User Group, 2/17/2011 21
  • 22. Load balancing and collector failover Agent Agent Collector Agent Agent Collector Agent Agent Collector • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist • Spread load if a collector goes down. • Spread load if new collectors added to the system. Austin Hadoop User Group, 2/17/2011 22
  • 23. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 23
  • 24. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 24
  • 25. Control plane is horizontally scalable Node Master ZK1 Node Master ZK2 Node Master ZK3 • A master controls dynamic configurations of nodes – Uses consensus protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future • Nodes can talk to any master. • Masters can talk to an existing ZK ensemble Austin Hadoop User Group, 2/17/2011 25
  • 26. MANAGEABILITY Austin Hadoop User Group, 2/17/2011 Wheeeeee! 26
  • 27. Centralized Dataflow Management Interfaces • One place to specify node sources, sinks and data flows. • Basic Web interface • Flume Shell – Command line interface – Scriptable • Cloudera Enterprise – Flume Monitor App – Graphical web interface Austin Hadoop User Group, 2/17/2011 27
  • 28. Configuring Flume fan console tail filter out roll hdfs Node: tail(“file”) | filter [ console, roll(1000) { dfs(“hdfs://namenode/user/flume”) } ] ; • A concise and precise configuration language for specifying dataflows in a node. • Dynamic updates of configurations – Allows for live failover changes – Allows for handling newly provisioned machines – Allows for changing analytics Austin Hadoop User Group, 2/17/2011 28
  • 29. Output bucketing Collector /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt HDFS /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt … Collector node : collectorSource | collectorSink (“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”) • Automatic output file management – Write hdfs files in over time based tags Austin Hadoop User Group, 2/17/2011 29
  • 30. EXTENSIBILITY Austin Hadoop User Group, 2/17/2011 30
  • 31. Flume is easy to extend • Simple source and sink APIs – An event streaming design – Many simple operations composes for complex behavior • Plug-in architecture so you can add your own sources, sinks and decorators and sinks fan sink source deco out deco sink Austin Hadoop User Group, 2/17/2011 31
  • 32. Variety of Connectors • Sources produce data – Console, Exec, Syslog, Scribe, IRC, Twitter, – In the works: JMS, AMQP, pubsubhubbub/RSS/Atom • Sinks consume data source – Console, Local files, HDFS, S3 – Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra (Riptano/DataStax), Voldemort, Elastic Search – In the works: JMS, AMQP sink • Decorators modify data sent to sinks – Wire batching, compression, sampling, projection, extraction, throughput throttling – Custom near real-time processing (Meebo) – JRuby event modifiers (InfoChimps) deco – Cryptographic extensions(Rearden) Austin Hadoop User Group, 2/17/2011 32
  • 33. : Multi Datacenter Collector tier api api api Agent Agent Agent Collector API server api Agent api api api Agent Agent Agent Collector api Agent api api api Agent Agent Agent Collector api Agent HDFS Processor server api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector Austin Hadoop User Group, 2/17/2011 33
  • 34. : Multi Datacenter Collector tier api api api Agent Agent Agent Collector API server api Agent api api api Agent Agent Agent Collector api Agent api api api Agent Agent Agent Collector api Agent Relay HDFS Processor server api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector api Agent api Agent api Agent proc Agent Collector Austin Hadoop User Group, 2/17/2011 34
  • 35. : Near Realtime Aggregator Flume Ad svr Agent Ad svr Agent Tracker Collector HDFS Ad svr Agent Ad svr Agent quick reports DB Hive job verify reports Austin Hadoop User Group, 2/17/2011 35
  • 36. An enterprise story Flume Collector tier api api api Agent Agent Agent Collector API server api Win Kerberos HDFS api api api Agent Agent Agent Collector api Linux DD DD DD api api api Agent Agent Agent Collector api Linux Active Directory / LDAP Austin Hadoop User Group, 2/17/2011 36
  • 37. An emerging community story Flume Agent Agent Hive query svr Agent Agent HDFS Pig query hdfs Key lookup Collector Fanout hbase HBase Range query index Incremental Search query Search Idx Faceted query Austin Hadoop User Group, 2/17/2011 37
  • 38. OPENNESS AND COMMUNITY Austin Hadoop User Group, 2/17/2011 38
  • 39. Flume is Open Source • Apache v2.0 Open Source License – Independent from Apache Software Foundation • GitHub source code repository – http://github.com/cloudera/flume – Regular tarball update versions every 2-3 months. – Regular CDH packaging updates every 3-4 months. • Review Board for code review • New external committers wanted! – Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer – Independent folks: Bruce Mitchener Austin Hadoop User Group, 2/17/2011 39
  • 40. Growing user and developer community • History: – Initial Open Source Release, June 2010 • Growth: – Pre-Hadoop Summit (Late June 2010): • 4 followers, 4 forks (original authors) – Pre-Hadoop World (October 2010): • 174 followers, 34 forks – Pre-CDH3B4 Release (February 2011): • 288 followers, 51 forks Austin Hadoop User Group, 2/17/2011 40
  • 41. Support • Community-based mailing lists for support – “an answer in a few days” – User: https://groups.google.com/a/cloudera.org/group/flume-user – Dev: https://groups.google.com/a/cloudera.org/group/flume-dev • Community-based IRC chat room – “quick questions, quick answers” – #flume in irc.freenode.net • Commercial support with Cloudera Enterprise subscription – Chat with sales@cloudera.com Austin Hadoop User Group, 2/17/2011 41
  • 42. CONCLUSIONS Austin Hadoop User Group, 2/17/2011 42
  • 43. Summary • Flume is a distributed, reliable, scalable, extensible system for collecting and delivering high-volume continuous event data such as logs. – It is centrally managed, which allows for automated and adaptive configurations. – This design allows for near-real time processing. – Apache v2.0 License with active and growing community • Part of Cloudera’s Distribution for Hadoop, about to be refreshed for CDH3b4. Austin Hadoop User Group, 2/17/2011 43
  • 44. Questions? (and shameless plugs) • Contact info: – jon@cloudera.com – Twitter @jmhsieh • Cloudera Training in Dallas – Hadoop Training for Developers - March 14-16 – Hadoop Training for Administrators - March 17-18 – Sign up at http://cloudera.eventbrite.com – 10% discount code for classes "hug“ • Cloudera is Hiring! Austin Hadoop User Group, 2/17/2011 44