SlideShare a Scribd company logo
Introduction to Flume and Flive

              July 11, 2012
               Willis Gong
       Big Data Engineering Team
              Hanborq Inc.
Topic
• Flume
  – Definition of the solution
  – Characteristics
  – Core concepts
• Flive
  – Concepts
  – Improvements



                                     2
The real world problem
• Changing requirements Extensibility & Manageability
   – In the source
   – In the path
   – In the sink
• Growing scales  Scalability
   – Volume/nodes keep increasing
• Error prone  Reliability
   – Network failure
   – Service breakdown
Flume: the solution to these problems

• Flume is:
   – A distributed data collection system
   – A streamlined event processing pipeline
   – A extensible distributed computation
     framework
• Flume answers previous challenges
   –   Easily extends to new data formats
   –   Easily adapts new collecting strategies
   –   Scales linearly as new node added
   –   Multi level of reliability
   –   Configurable from shell / web
   –   Etc.
Core Concepts: Flow and Event
•   Everything is event – body + meta table
•   A flow is a event pipeline from a particular data source
•   Flows are comprised of nodes chained together
•   Many flows may overlap a physical cluster
Core Concepts: Nodes and Plane
• Data plane:
   – Path of data flow
   – Composited by one or more node in a tiered
     architecture
      • Two-tier: Agent  Collector
      • Multi-tier: Agent  Processor  Collector
• Nodes:
   – Nodes have a source and a sink
   – Their roles depend on their position in data path
• Masters are in the control plane
   – Central control point
   – Light weighted since no data plane processing involved
Core Concepts: Agent and Collector

• Data plane nodes
  – Agent
     • receives data from an application
  – Processor(optional)
     • Intermediate processing
  – Collector
     • Write data to permanent storage
Deploy Topology
• Deploy considerations
  – Agents: depend on application data source
  – Collectors: depend on targeting storage, network topology,
    load balance, etc
Considerations on Data Source
• Three integration modes:
  – Push: agent as a data collecting service
    for data source application
  – Pull: agent poll data source periodically
  – Embedded: data source application is the
    agent itself
Data Plane Reliability
• Best effort
   – Fire and forget
• Store on failure + retry
   – Local acks, local errors detectable
   – Failover when faults detected
• End-to-end reliability
   – End to end acks
   – Data survives compound failures
   – At least once
Control Plane Reliability
• Master design
  – Light-weighted process
     • Isolated from data plane processing
  – Lazy design
     • simply answer a few node requests
• Service availability
  – Watch dog
  – Multi masters backup
  – Service availability between reboot
     • Persist configuration data to ZooKeeper
Data Plane Scalability
• Data plane is horizontally scalable
   – Add collectors to increase availability and to handle more data
      • Assumes a single agent will not dominate a collector
      • Fewer connections to HDFS.
      • Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
   – Write log locally to avoid collector disk IO bottleneck and catastrophic
     failures
   – Compression and batching (trade cpu for network)
   – Push computation into the event collection pipeline (balance IO, Mem,
     and CPU resource bottlenecks)
Data Plane Scalability
• Agents are logically partitioned and send to different
  collectors
• Use randomization to pre-specify failovers when many
  collectors exist
  – Spread load if a collector goes down.
  – Spread load if new collectors added to the system.
Control Plane Scalability
• A master controls dynamic configurations of nodes
  – Uses gossip protocol to keep state consistent
  – Scales well for configuration reads
  – Allows for adaptive repartitioning in the future
  – Nodes can talk to any master.
Extensibility
• Extensibility answers to changing use cases
   – Invent new connector
      • Simple source/sink/decorator APIs
      • Plug-in architecture
   – Dynamic wired pipeline processing logic
      • Many simple operations composes for complex behavior
• Connector
   – Sources produce data: plain text files, directory, Log4j, FTP, SQL, …
   – Sinks consume data: console, HDFS, local file system
   – Decorators modify data sent to sinks
Extensibility
• Example
Manageability
• Near natural language for node configure
   – web-log-agent : tail(“/var/log/httpd.log”) | agentBESink
   – web-log-collector : autoCollectorSource
     | { regex(“(Firefox|Internet Explorer)”, “browser”) =>
     collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }
• One place to specify node sources, sinks and data flows
   – Basic Web interface
   – Flume Shell – command line interface
   – Extended custom management thru master RPC API
Flive – HANBORQ Enhanced Flume
• Based on Flume but with HANBORQ product ecosystem
  orientation
• The new HTLoad
• Enhancements:
  – Performance
  – Functionality
  – Manageability
  – Hugetable integration
• Compatible with original Flume usage

                                                      18
Flive – More Than Flume
• Efficiency improvement
  – Driving the pipeline
     • Native driver is a single thread doing source-pulling and sink-pushing
        – Temporal rate mismatch in source and sink may affect each other
     • Flive use two threads, one source-pulling and one sink-pushing,
       coupled by internal event queue
        – Temporal rate variances in source and sink are filtered by the queue
        – Contributes 10%~30% throughput improvement
  – Introduced node concurrency to maximize target storage
    bandwidth
Flive – More Than Flume
• Functionality enhancement
  – Native Flume connector conf spec syntax is flat
     • But connectors are hierarchical essentially
     • Limited flat syntax also limits connectors to be flatly assembled
     • Assemble connector hierarchy thru hard code, or ad-hoc syntax
  – Flive introduced hierarchical syntax
     • Hierarchical connector architecture can be dynamically wired
     • For backward compatibility, only Flive connector support enhanced
       syntax
Flive – More Than Flume
• Ease of use
  – Zero-configure plug-in architecture
     • Native flume mandates handy configure about plugins
     • Flive no longer requires any configure but minimal conventions
  – Simpler, but yet powerful Flive shell
  – Introduced the translator framework
     • Node configuration specs may be too complicate to be manually edited
     • Translator helps translate user domain spec to Flive/Flume configuration
       spec
     • Extendable
         – Hugetable translator for Hugetable
         – Basic translator for native Flume – full Flume compatibility
  – Ease of deploy and management
Flive – More Than Flume
• As a Hugetable ETL
   – Sourcing structured data from various sources
      • FS, FTP, SQL, LOG4J, …
   – Targeting all Hugetable storage engine
      • Text File, Sequence File, RCFile, HFile, HBase,…
   – Filtering unwanted/malformed records
   – Column transfer over the air
      • IUD like single stream column op: based on function expression
      • Multi stream op: pre-join in the fly
   – Multi table loading
      • Like fan-out but less overhead
   – Real time aggregation
      • Accurate computation: sum(x), count(*)
      • Probabilistic computation: count(distinct x), top(k), etc.
Runtime Flive
   •
                                                                        Flume Driver
    DataSource                                                                                 C-puller
                                                                                         Q3                         Q4

                 Tailer
                                                                                                                    C-pusher
Flume Driver                                                          T-server
                      A-puller        A-pusher                                                  Q5
                                                                            多线程解码
        Q1                       Q2                 network
                                                                                                          Decoder


                                                                                                Q6


                                                                                                          Driver
                                                          Collector
                                            Agent

                                                                                                Q7
                                                                            多线程Append



                                                                                              Appender




                                                                                 Hbase        HDFS                  Others
Thank you!

More Related Content

What's hot

Serval: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric NetworkingServal: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric Networking
Open Networking Summits
 
Quality of service
Quality of serviceQuality of service
Quality of service
Ismail Mukiibi
 
Brkrst 3123 previdi-final
Brkrst 3123 previdi-finalBrkrst 3123 previdi-final
Brkrst 3123 previdi-final
Stefano Previdi
 
BGP Advanced topics
BGP Advanced topicsBGP Advanced topics
BGP Advanced topics
Olivier Bonaventure
 
10 fn s40
10 fn s4010 fn s40
10 fn s40
Scott Foster
 
1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing
hptoga
 
Flume
FlumeFlume
IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?
Olivier Bonaventure
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
Erik Schmiegelow
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Kowndinya Mannepalli
 
IPv6 Entreprise Multihoming
IPv6 Entreprise MultihomingIPv6 Entreprise Multihoming
IPv6 Entreprise Multihoming
Olivier Bonaventure
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
Arinto Murdopo
 
Apache flume
Apache flumeApache flume
Apache flume
Ramakrishna kapa
 
Enabling IPv6 Services Transparently
Enabling IPv6 Services TransparentlyEnabling IPv6 Services Transparently
Enabling IPv6 Services Transparently
Carlos Martinez Cagnazzo
 
Ingest oct-9-update
Ingest oct-9-updateIngest oct-9-update
Ingest oct-9-update
Rufael Mekuria
 
Mpls Qos Jayk
Mpls Qos JaykMpls Qos Jayk
Mpls Qos Jayk
Suraj Kumar
 
CMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink ProtocolCMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink Protocol
Rufael Mekuria
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
Pradeep Kumar TS
 
Fpga implemented ahb protocol
Fpga implemented ahb protocolFpga implemented ahb protocol
Fpga implemented ahb protocol
iaemedu
 
ETE405-lec9.pdf
ETE405-lec9.pdfETE405-lec9.pdf
ETE405-lec9.pdf
mashiur
 

What's hot (20)

Serval: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric NetworkingServal: Software Defined Service-­Centric Networking
Serval: Software Defined Service-­Centric Networking
 
Quality of service
Quality of serviceQuality of service
Quality of service
 
Brkrst 3123 previdi-final
Brkrst 3123 previdi-finalBrkrst 3123 previdi-final
Brkrst 3123 previdi-final
 
BGP Advanced topics
BGP Advanced topicsBGP Advanced topics
BGP Advanced topics
 
10 fn s40
10 fn s4010 fn s40
10 fn s40
 
1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing1 bonica tutorial_segment_routing
1 bonica tutorial_segment_routing
 
Flume
FlumeFlume
Flume
 
IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?IPv6 Segment Routing : an end-to-end solution ?
IPv6 Segment Routing : an end-to-end solution ?
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
 
IPv6 Entreprise Multihoming
IPv6 Entreprise MultihomingIPv6 Entreprise Multihoming
IPv6 Entreprise Multihoming
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache flume
Apache flumeApache flume
Apache flume
 
Enabling IPv6 Services Transparently
Enabling IPv6 Services TransparentlyEnabling IPv6 Services Transparently
Enabling IPv6 Services Transparently
 
Ingest oct-9-update
Ingest oct-9-updateIngest oct-9-update
Ingest oct-9-update
 
Mpls Qos Jayk
Mpls Qos JaykMpls Qos Jayk
Mpls Qos Jayk
 
CMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink ProtocolCMAF Live Ingest Uplink Protocol
CMAF Live Ingest Uplink Protocol
 
RPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy NetworksRPL - Routing Protocol for Low Power and Lossy Networks
RPL - Routing Protocol for Low Power and Lossy Networks
 
Fpga implemented ahb protocol
Fpga implemented ahb protocolFpga implemented ahb protocol
Fpga implemented ahb protocol
 
ETE405-lec9.pdf
ETE405-lec9.pdfETE405-lec9.pdf
ETE405-lec9.pdf
 

Similar to Flume and Flive Introduction

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
Jeyamariappan Guru
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
Swapnil Dubey
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
Cloudera, Inc.
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
DataWorks Summit
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
Randy Abernethy
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Rajesh Balamohan
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
Jianfeng Zhang
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
Ganesan Narayanasamy
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
DataWorks Summit
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
Hari Shreedharan
 
Go with the Flow
Go with the Flow Go with the Flow
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2
Zobair Khan
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
Timothy Spann
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
Data Con LA
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
baggioss
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
Holger Winkelmann
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
Shravan (Sean) Pabba
 
LAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site ImplementationsLAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site Implementations
Perforce
 
PLB
PLBPLB

Similar to Flume and Flive Introduction (20)

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Inside Flume
Inside FlumeInside Flume
Inside Flume
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Apache Thrift, a brief introduction
Apache Thrift, a brief introductionApache Thrift, a brief introduction
Apache Thrift, a brief introduction
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Go with the Flow
Go with the Flow Go with the Flow
Go with the Flow
 
Go with the Flow-v2
Go with the Flow-v2Go with the Flow-v2
Go with the Flow-v2
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Hic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebookHic 2011 realtime_analytics_at_facebook
Hic 2011 realtime_analytics_at_facebook
 
FlowER Erlang Openflow Controller
FlowER Erlang Openflow ControllerFlowER Erlang Openflow Controller
FlowER Erlang Openflow Controller
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
LAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site ImplementationsLAB - Perforce Large Scale & Multi-Site Implementations
LAB - Perforce Large Scale & Multi-Site Implementations
 
PLB
PLBPLB
PLB
 

More from Hanborq Inc.

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
Hanborq Inc.
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
Hanborq Inc.
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
Hanborq Inc.
 
FlumeBase Study
FlumeBase StudyFlumeBase Study
FlumeBase Study
Hanborq Inc.
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
Hanborq Inc.
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
Hanborq Inc.
 
Hadoop Versioning
Hadoop VersioningHadoop Versioning
Hadoop Versioning
Hanborq Inc.
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler Introduction
Hanborq Inc.
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
Hanborq Inc.
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
Hanborq Inc.
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service Systems
Hanborq Inc.
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
Hanborq Inc.
 

More from Hanborq Inc. (12)

Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Hadoop大数据实践经验
Hadoop大数据实践经验Hadoop大数据实践经验
Hadoop大数据实践经验
 
FlumeBase Study
FlumeBase StudyFlumeBase Study
FlumeBase Study
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
HBase Introduction
HBase IntroductionHBase Introduction
HBase Introduction
 
Hadoop Versioning
Hadoop VersioningHadoop Versioning
Hadoop Versioning
 
Hadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler IntroductionHadoop MapReduce Task Scheduler Introduction
Hadoop MapReduce Task Scheduler Introduction
 
Hadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep InsightHadoop MapReduce Introduction and Deep Insight
Hadoop MapReduce Introduction and Deep Insight
 
Hadoop HDFS Detailed Introduction
Hadoop HDFS Detailed IntroductionHadoop HDFS Detailed Introduction
Hadoop HDFS Detailed Introduction
 
How to Build Cloud Storage Service Systems
How to Build Cloud Storage Service SystemsHow to Build Cloud Storage Service Systems
How to Build Cloud Storage Service Systems
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 

Recently uploaded

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 

Recently uploaded (20)

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 

Flume and Flive Introduction

  • 1. Introduction to Flume and Flive July 11, 2012 Willis Gong Big Data Engineering Team Hanborq Inc.
  • 2. Topic • Flume – Definition of the solution – Characteristics – Core concepts • Flive – Concepts – Improvements 2
  • 3. The real world problem • Changing requirements Extensibility & Manageability – In the source – In the path – In the sink • Growing scales  Scalability – Volume/nodes keep increasing • Error prone  Reliability – Network failure – Service breakdown
  • 4. Flume: the solution to these problems • Flume is: – A distributed data collection system – A streamlined event processing pipeline – A extensible distributed computation framework • Flume answers previous challenges – Easily extends to new data formats – Easily adapts new collecting strategies – Scales linearly as new node added – Multi level of reliability – Configurable from shell / web – Etc.
  • 5. Core Concepts: Flow and Event • Everything is event – body + meta table • A flow is a event pipeline from a particular data source • Flows are comprised of nodes chained together • Many flows may overlap a physical cluster
  • 6. Core Concepts: Nodes and Plane • Data plane: – Path of data flow – Composited by one or more node in a tiered architecture • Two-tier: Agent  Collector • Multi-tier: Agent  Processor  Collector • Nodes: – Nodes have a source and a sink – Their roles depend on their position in data path • Masters are in the control plane – Central control point – Light weighted since no data plane processing involved
  • 7. Core Concepts: Agent and Collector • Data plane nodes – Agent • receives data from an application – Processor(optional) • Intermediate processing – Collector • Write data to permanent storage
  • 8. Deploy Topology • Deploy considerations – Agents: depend on application data source – Collectors: depend on targeting storage, network topology, load balance, etc
  • 9. Considerations on Data Source • Three integration modes: – Push: agent as a data collecting service for data source application – Pull: agent poll data source periodically – Embedded: data source application is the agent itself
  • 10. Data Plane Reliability • Best effort – Fire and forget • Store on failure + retry – Local acks, local errors detectable – Failover when faults detected • End-to-end reliability – End to end acks – Data survives compound failures – At least once
  • 11. Control Plane Reliability • Master design – Light-weighted process • Isolated from data plane processing – Lazy design • simply answer a few node requests • Service availability – Watch dog – Multi masters backup – Service availability between reboot • Persist configuration data to ZooKeeper
  • 12. Data Plane Scalability • Data plane is horizontally scalable – Add collectors to increase availability and to handle more data • Assumes a single agent will not dominate a collector • Fewer connections to HDFS. • Larger more efficient writes to HDFS. • Agents have mechanisms for machine resource tradeoffs – Write log locally to avoid collector disk IO bottleneck and catastrophic failures – Compression and batching (trade cpu for network) – Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)
  • 13. Data Plane Scalability • Agents are logically partitioned and send to different collectors • Use randomization to pre-specify failovers when many collectors exist – Spread load if a collector goes down. – Spread load if new collectors added to the system.
  • 14. Control Plane Scalability • A master controls dynamic configurations of nodes – Uses gossip protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future – Nodes can talk to any master.
  • 15. Extensibility • Extensibility answers to changing use cases – Invent new connector • Simple source/sink/decorator APIs • Plug-in architecture – Dynamic wired pipeline processing logic • Many simple operations composes for complex behavior • Connector – Sources produce data: plain text files, directory, Log4j, FTP, SQL, … – Sinks consume data: console, HDFS, local file system – Decorators modify data sent to sinks
  • 17. Manageability • Near natural language for node configure – web-log-agent : tail(“/var/log/httpd.log”) | agentBESink – web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) } • One place to specify node sources, sinks and data flows – Basic Web interface – Flume Shell – command line interface – Extended custom management thru master RPC API
  • 18. Flive – HANBORQ Enhanced Flume • Based on Flume but with HANBORQ product ecosystem orientation • The new HTLoad • Enhancements: – Performance – Functionality – Manageability – Hugetable integration • Compatible with original Flume usage 18
  • 19. Flive – More Than Flume • Efficiency improvement – Driving the pipeline • Native driver is a single thread doing source-pulling and sink-pushing – Temporal rate mismatch in source and sink may affect each other • Flive use two threads, one source-pulling and one sink-pushing, coupled by internal event queue – Temporal rate variances in source and sink are filtered by the queue – Contributes 10%~30% throughput improvement – Introduced node concurrency to maximize target storage bandwidth
  • 20. Flive – More Than Flume • Functionality enhancement – Native Flume connector conf spec syntax is flat • But connectors are hierarchical essentially • Limited flat syntax also limits connectors to be flatly assembled • Assemble connector hierarchy thru hard code, or ad-hoc syntax – Flive introduced hierarchical syntax • Hierarchical connector architecture can be dynamically wired • For backward compatibility, only Flive connector support enhanced syntax
  • 21. Flive – More Than Flume • Ease of use – Zero-configure plug-in architecture • Native flume mandates handy configure about plugins • Flive no longer requires any configure but minimal conventions – Simpler, but yet powerful Flive shell – Introduced the translator framework • Node configuration specs may be too complicate to be manually edited • Translator helps translate user domain spec to Flive/Flume configuration spec • Extendable – Hugetable translator for Hugetable – Basic translator for native Flume – full Flume compatibility – Ease of deploy and management
  • 22. Flive – More Than Flume • As a Hugetable ETL – Sourcing structured data from various sources • FS, FTP, SQL, LOG4J, … – Targeting all Hugetable storage engine • Text File, Sequence File, RCFile, HFile, HBase,… – Filtering unwanted/malformed records – Column transfer over the air • IUD like single stream column op: based on function expression • Multi stream op: pre-join in the fly – Multi table loading • Like fan-out but less overhead – Real time aggregation • Accurate computation: sum(x), count(*) • Probabilistic computation: count(distinct x), top(k), etc.
  • 23. Runtime Flive • Flume Driver DataSource C-puller Q3 Q4 Tailer C-pusher Flume Driver T-server A-puller A-pusher Q5 多线程解码 Q1 Q2 network Decoder Q6 Driver Collector Agent Q7 多线程Append Appender Hbase HDFS Others