Flume and Flive Introduction


Published on

Introduction of Flume and Hanborq-Enhanced Flume -- Flive, for training.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Flume and Flive Introduction

  1. 1. Introduction to Flume and Flive July 11, 2012 Willis Gong Big Data Engineering Team Hanborq Inc.
  2. 2. Topic• Flume – Definition of the solution – Characteristics – Core concepts• Flive – Concepts – Improvements 2
  3. 3. The real world problem• Changing requirements Extensibility & Manageability – In the source – In the path – In the sink• Growing scales  Scalability – Volume/nodes keep increasing• Error prone  Reliability – Network failure – Service breakdown
  4. 4. Flume: the solution to these problems• Flume is: – A distributed data collection system – A streamlined event processing pipeline – A extensible distributed computation framework• Flume answers previous challenges – Easily extends to new data formats – Easily adapts new collecting strategies – Scales linearly as new node added – Multi level of reliability – Configurable from shell / web – Etc.
  5. 5. Core Concepts: Flow and Event• Everything is event – body + meta table• A flow is a event pipeline from a particular data source• Flows are comprised of nodes chained together• Many flows may overlap a physical cluster
  6. 6. Core Concepts: Nodes and Plane• Data plane: – Path of data flow – Composited by one or more node in a tiered architecture • Two-tier: Agent  Collector • Multi-tier: Agent  Processor  Collector• Nodes: – Nodes have a source and a sink – Their roles depend on their position in data path• Masters are in the control plane – Central control point – Light weighted since no data plane processing involved
  7. 7. Core Concepts: Agent and Collector• Data plane nodes – Agent • receives data from an application – Processor(optional) • Intermediate processing – Collector • Write data to permanent storage
  8. 8. Deploy Topology• Deploy considerations – Agents: depend on application data source – Collectors: depend on targeting storage, network topology, load balance, etc
  9. 9. Considerations on Data Source• Three integration modes: – Push: agent as a data collecting service for data source application – Pull: agent poll data source periodically – Embedded: data source application is the agent itself
  10. 10. Data Plane Reliability• Best effort – Fire and forget• Store on failure + retry – Local acks, local errors detectable – Failover when faults detected• End-to-end reliability – End to end acks – Data survives compound failures – At least once
  11. 11. Control Plane Reliability• Master design – Light-weighted process • Isolated from data plane processing – Lazy design • simply answer a few node requests• Service availability – Watch dog – Multi masters backup – Service availability between reboot • Persist configuration data to ZooKeeper
  12. 12. Data Plane Scalability• Data plane is horizontally scalable – Add collectors to increase availability and to handle more data • Assumes a single agent will not dominate a collector • Fewer connections to HDFS. • Larger more efficient writes to HDFS.• Agents have mechanisms for machine resource tradeoffs – Write log locally to avoid collector disk IO bottleneck and catastrophic failures – Compression and batching (trade cpu for network) – Push computation into the event collection pipeline (balance IO, Mem, and CPU resource bottlenecks)
  13. 13. Data Plane Scalability• Agents are logically partitioned and send to different collectors• Use randomization to pre-specify failovers when many collectors exist – Spread load if a collector goes down. – Spread load if new collectors added to the system.
  14. 14. Control Plane Scalability• A master controls dynamic configurations of nodes – Uses gossip protocol to keep state consistent – Scales well for configuration reads – Allows for adaptive repartitioning in the future – Nodes can talk to any master.
  15. 15. Extensibility• Extensibility answers to changing use cases – Invent new connector • Simple source/sink/decorator APIs • Plug-in architecture – Dynamic wired pipeline processing logic • Many simple operations composes for complex behavior• Connector – Sources produce data: plain text files, directory, Log4j, FTP, SQL, … – Sinks consume data: console, HDFS, local file system – Decorators modify data sent to sinks
  16. 16. Extensibility• Example
  17. 17. Manageability• Near natural language for node configure – web-log-agent : tail(“/var/log/httpd.log”) | agentBESink – web-log-collector : autoCollectorSource | { regex(“(Firefox|Internet Explorer)”, “browser”) => collectorSink(“hdfs://namenode/flume-logs/%{browser}”) }• One place to specify node sources, sinks and data flows – Basic Web interface – Flume Shell – command line interface – Extended custom management thru master RPC API
  18. 18. Flive – HANBORQ Enhanced Flume• Based on Flume but with HANBORQ product ecosystem orientation• The new HTLoad• Enhancements: – Performance – Functionality – Manageability – Hugetable integration• Compatible with original Flume usage 18
  19. 19. Flive – More Than Flume• Efficiency improvement – Driving the pipeline • Native driver is a single thread doing source-pulling and sink-pushing – Temporal rate mismatch in source and sink may affect each other • Flive use two threads, one source-pulling and one sink-pushing, coupled by internal event queue – Temporal rate variances in source and sink are filtered by the queue – Contributes 10%~30% throughput improvement – Introduced node concurrency to maximize target storage bandwidth
  20. 20. Flive – More Than Flume• Functionality enhancement – Native Flume connector conf spec syntax is flat • But connectors are hierarchical essentially • Limited flat syntax also limits connectors to be flatly assembled • Assemble connector hierarchy thru hard code, or ad-hoc syntax – Flive introduced hierarchical syntax • Hierarchical connector architecture can be dynamically wired • For backward compatibility, only Flive connector support enhanced syntax
  21. 21. Flive – More Than Flume• Ease of use – Zero-configure plug-in architecture • Native flume mandates handy configure about plugins • Flive no longer requires any configure but minimal conventions – Simpler, but yet powerful Flive shell – Introduced the translator framework • Node configuration specs may be too complicate to be manually edited • Translator helps translate user domain spec to Flive/Flume configuration spec • Extendable – Hugetable translator for Hugetable – Basic translator for native Flume – full Flume compatibility – Ease of deploy and management
  22. 22. Flive – More Than Flume• As a Hugetable ETL – Sourcing structured data from various sources • FS, FTP, SQL, LOG4J, … – Targeting all Hugetable storage engine • Text File, Sequence File, RCFile, HFile, HBase,… – Filtering unwanted/malformed records – Column transfer over the air • IUD like single stream column op: based on function expression • Multi stream op: pre-join in the fly – Multi table loading • Like fan-out but less overhead – Real time aggregation • Accurate computation: sum(x), count(*) • Probabilistic computation: count(distinct x), top(k), etc.
  23. 23. Runtime Flive • Flume Driver DataSource C-puller Q3 Q4 Tailer C-pusherFlume Driver T-server A-puller A-pusher Q5 多线程解码 Q1 Q2 network Decoder Q6 Driver Collector Agent Q7 多线程Append Appender Hbase HDFS Others
  24. 24. Thank you!