SlideShare a Scribd company logo
1 of 31
Download to read offline
Apache Flume 
Data Aggregation At Scale 
Arvind Prabhakar 
© 2014 StreamSets Inc., All rights reserved 
© 2014 StreamSets, Inc.
Who am I? 
© 2014 StreamSets, Inc. 
❏ Founder/CTO 
Apache Software Foundation 
❏ Flume - PMC Chair 
❏ Sqoop - PMC Chair 
❏ Storm - PMC, Committer 
❏ MetaModel - Mentor 
❏ Sentry - Mentor 
❏ ASF Member 
Previously... 
❏ Cloudera 
❏ Informatica
What is Flume? 
© 2014 StreamSets, Inc. 
Logs 
Files 
Click 
Streams 
Sensors 
Devices 
Database 
Logs 
Social Data 
Streams 
Feeds 
Other 
Raw Storage 
(HDFS, S3) 
EDW, NoSQL 
(Hive, Impala, 
HBase, 
Cassandra) 
Search 
(Solr, 
ElasticSearch) 
Enterprise Data 
Infrastructure 
Apache Flume is a 
continuous data 
ingestion system that 
is... 
● open-source, 
● reliable, 
● scalable, 
● manageable, 
● customizable, 
...and designed for Big 
Data ecosystem.
...for Big Data ecosystem? 
“Big data is an all-encompassing term for any collection 
of data sets so large and complex that it becomes difficult 
to process using traditional data processing applications.” 
Big Data from a Data Ingestion Perspective 
● Logical Data Sources are physically distributed 
● Data production is continuous / never ending 
● Data structure and semantics change without notice 
© 2014 StreamSets, Inc.
Physically Distributed Data Sources 
© 2014 StreamSets, Inc. 
● Many physical sources 
that produce data 
● Number of physical 
sources changes 
constantly 
● Sources may exist in 
different governance 
zones, data centers, 
continents...
Continuous Data Production 
“Every two days now we create as much information 
as we did from the dawn of civilization up until 2003” 
© 2014 StreamSets, Inc. 
- Eric Schmidt, 2010 
● Weather 
● Traffic 
● Automobiles 
● Trains 
● Airplanes 
● Geological/Seismic 
● Oceanographic 
● Smart Phones 
● Health Accessories 
● Medical Devices 
● Home Automation 
● Digital Cameras 
● Social Media 
● Geolocation 
● Shop Floor Sensors 
● Network Activity 
● Industry Appliances 
● Security/Surveillance 
● Server Workloads 
● Digital Telephony 
● Bio-simulations...
Ever Changing Structure of Data 
● One of your data centers 
upgrade to IPv6 192.168.0.4 
© 2014 StreamSets, Inc. 
fe80::21b:21ff:fe83:90fa 
M0137: User {jonsmith} granted access to {accounts} 
M0137: [jonsmith] granted access to [sys.accounts] 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121” 
} 
{ 
“first”:”jon”, 
“last”:”smith”, 
“add1”:”123 Main St.”, 
“add2”:”Ste - 4”, 
“city”:”Little Town”, 
“state”:”AZ”, 
“zip”: “12121”, 
“phone”: “(408) 555-1212” 
} 
● Application developer 
changes logs (again) 
● JSON data may contain 
more attributes than 
expected
So, from Data Ingestion Perspective: 
Massive collection of ever changing physical sources... 
Never ending data production... 
Data structure and semantics evolve continuously... 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Flume to the Rescue!
Apache Flume 
● Originally designed to be a log 
aggregation system by 
Cloudera Engineers 
● Evolved to handle any type of 
streaming event data 
● Low-cost of installation, 
operation and maintenance 
● Highly customizable and 
extendable 
© 2014 StreamSets, Inc.
A Closer Look at Flume 
Input Agent Agent Agent Agent Destination 
● Distributed Pipeline Architecture 
● Optimized for commonly used data sources and destinations 
● Built in support for contextual routing 
● Fully customizable and extendable 
© 2014 StreamSets, Inc.
Anatomy of a Flume Agent 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source 
● Accepts incoming 
Data 
● Scales as required 
● Writes data to 
Channel 
Sink 
● Removes data from 
Channel 
● Sends data to 
downstream Agent or 
Destination 
Channel 
● Stores data in the 
order received
Transactional Data Exchange 
Upstream Sink TX 
© 2014 StreamSets, Inc. 
Flume Agent 
Source 
Sink 
Channel 
Incoming 
Data 
Outgoing 
Data 
Source TX 
Sink TX 
● Source uses transactions to write to the channel 
● Sink uses transactions to remove data from the channel 
● Sink transaction commits only after successful transfer of data 
● This ensures no data loss in Flume pipeline
Routing and Replicating 
© 2014 StreamSets, Inc. 
Flume Agent 
Source Sink 1 
Channel 1 Incoming 
Data 
Outgoing Data 
Channel 2 
Sink 2 Outgoing Data 
● Source can replicate or multiplex data across many channels 
● Metadata headers can be used to do contextual selection of 
channels 
● Channels can be drained by different sinks to different 
destinations or pipelines
Why Channels? 
● Buffers data and insulates downstream from load spikes 
● Provides persistent store for data in case the process restarts 
● Provides flow ordering* and transactional guarantees 
© 2014 StreamSets, Inc.
© 2014 StreamSets, Inc. 
Use-Case: Log Aggregation
Starting Out Simple 
● You would like to move your web-server 
© 2014 StreamSets, Inc. 
logs to HDFS 
● Let’s assume there are only 3 web 
servers at the time of launch 
● Ad-hoc solution will likely suffice! 
Challenges 
● How do you manage your output paths on HDFS? 
● How do you maintain your client code in face of changing 
environment as well as requirements?
Adding a Single Flume Agent 
Advantages 
● Insulation from HDFS downtime 
● Quick offload of logs from Web 
Server machines 
● Better Network utilization 
Challenges 
● What if the Flume node goes down? 
● Can one Flume node accommodate all load from Web Servers? 
© 2014 StreamSets, Inc.
Adding Two Flume Agents 
Advantages 
● Redundancy and Availability 
● Better handling of downstream 
failures 
● Automatic load balancing and 
failover 
Challenges 
● What happens when new Web Servers are added? 
● Can two Flume Agents keep up with all the load from more Web 
Servers? 
© 2014 StreamSets, Inc.
Handling a Server Farm 
© 2014 StreamSets, Inc. 
A Converging Flow 
● Traffic is aggregated by Tier-2 and 
Tier-3 before being put into 
destination system 
● Closer a tier is to the destination, 
larger the batch size it delivers 
downstream 
● Optimized handling of destination 
systems
Data Volume Per Agent 
© 2014 StreamSets, Inc. 
Batch Size Variation per Agent 
● Event volume is least in the 
outermost tier 
● Event volume increases as the 
flow converges 
● Event volume is highest in the 
innermost tier
Data Volume Per Tier 
© 2014 StreamSets, Inc. 
Batch Size Variation per Tier 
● In steady state, all tiers carry 
same event volume 
● Transient variations in flow are 
absorbed and ironed out by 
channels 
● Load spikes are handled smoothly 
without overwhelming the 
infrastructure
Planning and Sizing Flume Topology 
for Log-Aggregation Use-Case 
© 2014 StreamSets, Inc.
Planning and Sizing Your Topology 
What we need to know: 
● Number of Web Servers 
● Log volume per Web 
Server per unit time 
● Destination System and 
layout (Routing 
Requirements) 
● Worst case downtime for 
destination system 
© 2014 StreamSets, Inc. 
What we will calculate: 
● Number of tiers 
● Exit Batch Sizes 
● Channel capacity
Calculating Number of Tiers 
Rule of Thumb 
One Aggregating Agent (A) can be used with 
anywhere from 4 to 16 client Agents 
Considerations 
● Must handle projected ingest volume 
● Resulting number of tiers should provide for 
routing, load-balancing and failover 
requirements 
Gotchas 
Load test to ensure that steady state and peak 
load are addressed with adequate failover capacity 
© 2014 StreamSets, Inc.
Calculating Exit Batch Size 
Rule of Thumb 
Exit batch size is same as total exit data volume 
divided by number of Agents in a tier 
Considerations 
● Having some extra room is good 
● Keep contextual routing in mind 
● Consider duplication impact when batch 
sizes are large 
Gotchas 
Load test fail-over scenario to ensure near 
steady-state drain 
© 2014 StreamSets, Inc.
Calculating Channel Capacity 
Gotchas 
© 2014 StreamSets, Inc. 
Source 
Sink X 
Source 
Sink X 
X 
Rule of Thumb 
Equal to worst case data ingest rate sustained 
over the worst case downstream outage 
interval 
Considerations 
● Multiple disks will yield better performance 
● Channel size impacts the back-pressure 
buildup in the pipeline 
You may need more disk space than the 
physical footprint of the data size
To Recap 
Number of Tiers 
Calculated with upstream to downstream Agent ration ranging from 4:1 to 
16:1. Factor in routing, failover, load-balancing requirements... 
Exit Batch Size 
Calculated for steady state data volume exiting the tier, divided by 
number of Agents in that tier. Factor in contextual routing and duplication 
due to transient failure impact... 
Channel Capacity 
Calculated as worst case ingest rate sustained over the worst case 
downstream downtime. Factor in number of disks used etc... 
© 2014 StreamSets, Inc. 
...and that’s all there is to it!
Some Highlights of Flume 
● Flume is suitable for large volume data collection, especially when 
data is being produced in multiple locations 
● Once planned and sized appropriately, Flume will practically run 
itself without any operational intervention 
● Flume provides weak ordering guarantee, i.e., in the absence of 
failures the data will arrive in the order it was received in the Flume 
pipeline 
● Transactional exchange ensures that Flume never loses any data in 
transit between Agents. Sinks use transactions to ensure data is not 
lost at point of ingest or terminal destinations. 
● Flume has rich out-of-the box features such as contextual routing, 
and support for popular data sources and destination systems 
© 2014 StreamSets, Inc.
Things that could be better... 
● Handling of poison events 
● Ability to tail files 
● Ability to handle preset data formats such as JSON, CSV, XML 
● Centralized configuration 
● Once-only delivery semantics 
● ...and more 
Remember: patches are welcome! 
© 2014 StreamSets, Inc.
Thank You! 
Contact: 
● Email: arvind at streamsets dot com 
● Twitter: @aprabhakar 
More on Flume: 
● http://flume.apache.org/ 
● User Mailing List: user-subscribe@flume.apache.org 
● Developer Mailing List: dev-subscribe@flume.apache.org 
● JIRA: https://issues.apache.org/jira/browse/FLUME 
© 2014 StreamSets, Inc.

More Related Content

What's hot

Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop DataWorks Summit/Hadoop Summit
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Apache Flume
Apache FlumeApache Flume
Apache FlumeGetInData
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...DataWorks Summit
 

What's hot (20)

Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
 
Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache KuduLow latency high throughput streaming using Apache Apex and Apache Kudu
Low latency high throughput streaming using Apache Apex and Apache Kudu
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
 
YARN Federation
YARN Federation YARN Federation
YARN Federation
 

Viewers also liked

大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 Chao Zhu
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches DataWorks Summit
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 

Viewers also liked (10)

大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 

Similar to Data Aggregation At Scale Using Apache Flume

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingSafe Software
 
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingFull Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingSafe Software
 
Network characteristics of the cloud
Network characteristics of the cloudNetwork characteristics of the cloud
Network characteristics of the cloudCloud Genius
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetrypphaal
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Data Con LA
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupPramod Immaneni
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...Yahoo Developer Network
 
Global Trading Infrastructure Services
Global Trading Infrastructure ServicesGlobal Trading Infrastructure Services
Global Trading Infrastructure ServicesQuanthouse
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Anatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptxAnatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptxdusavamsikrisna
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHortonworks
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Impetus Technologies
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesAlexander Penev
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceDatabricks
 
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.Zenoss
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 

Similar to Data Aggregation At Scale Using Apache Flume (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data Streaming
 
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream ProcessingFull Stream Ahead: Authoring Workflows for Scalable Stream Processing
Full Stream Ahead: Authoring Workflows for Scalable Stream Processing
 
Network characteristics of the cloud
Network characteristics of the cloudNetwork characteristics of the cloud
Network characteristics of the cloud
 
Network visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetryNetwork visibility and control using industry standard sFlow telemetry
Network visibility and control using industry standard sFlow telemetry
 
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
 
Apache Apex - Hadoop Users Group
Apache Apex - Hadoop Users GroupApache Apex - Hadoop Users Group
Apache Apex - Hadoop Users Group
 
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...
 
Global Trading Infrastructure Services
Global Trading Infrastructure ServicesGlobal Trading Infrastructure Services
Global Trading Infrastructure Services
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Anatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptxAnatomy behind Fast Data Applications.pptx
Anatomy behind Fast Data Applications.pptx
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
Real-time Streaming Analytics for Enterprises based on Apache Storm - Impetus...
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a ServiceZeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
 
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
Lag. Crackle. Pause. Keeping Your Unified Communications in Check.
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdfPearlKirahMaeRagusta1
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...kalichargn70th171
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 

Data Aggregation At Scale Using Apache Flume

  • 1. Apache Flume Data Aggregation At Scale Arvind Prabhakar © 2014 StreamSets Inc., All rights reserved © 2014 StreamSets, Inc.
  • 2. Who am I? © 2014 StreamSets, Inc. ❏ Founder/CTO Apache Software Foundation ❏ Flume - PMC Chair ❏ Sqoop - PMC Chair ❏ Storm - PMC, Committer ❏ MetaModel - Mentor ❏ Sentry - Mentor ❏ ASF Member Previously... ❏ Cloudera ❏ Informatica
  • 3. What is Flume? © 2014 StreamSets, Inc. Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ● open-source, ● reliable, ● scalable, ● manageable, ● customizable, ...and designed for Big Data ecosystem.
  • 4. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ● Logical Data Sources are physically distributed ● Data production is continuous / never ending ● Data structure and semantics change without notice © 2014 StreamSets, Inc.
  • 5. Physically Distributed Data Sources © 2014 StreamSets, Inc. ● Many physical sources that produce data ● Number of physical sources changes constantly ● Sources may exist in different governance zones, data centers, continents...
  • 6. Continuous Data Production “Every two days now we create as much information as we did from the dawn of civilization up until 2003” © 2014 StreamSets, Inc. - Eric Schmidt, 2010 ● Weather ● Traffic ● Automobiles ● Trains ● Airplanes ● Geological/Seismic ● Oceanographic ● Smart Phones ● Health Accessories ● Medical Devices ● Home Automation ● Digital Cameras ● Social Media ● Geolocation ● Shop Floor Sensors ● Network Activity ● Industry Appliances ● Security/Surveillance ● Server Workloads ● Digital Telephony ● Bio-simulations...
  • 7. Ever Changing Structure of Data ● One of your data centers upgrade to IPv6 192.168.0.4 © 2014 StreamSets, Inc. fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ● Application developer changes logs (again) ● JSON data may contain more attributes than expected
  • 8. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Data structure and semantics evolve continuously... © 2014 StreamSets, Inc.
  • 9. © 2014 StreamSets, Inc. Flume to the Rescue!
  • 10. Apache Flume ● Originally designed to be a log aggregation system by Cloudera Engineers ● Evolved to handle any type of streaming event data ● Low-cost of installation, operation and maintenance ● Highly customizable and extendable © 2014 StreamSets, Inc.
  • 11. A Closer Look at Flume Input Agent Agent Agent Agent Destination ● Distributed Pipeline Architecture ● Optimized for commonly used data sources and destinations ● Built in support for contextual routing ● Fully customizable and extendable © 2014 StreamSets, Inc.
  • 12. Anatomy of a Flume Agent © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received
  • 13. Transactional Data Exchange Upstream Sink TX © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ● Source uses transactions to write to the channel ● Sink uses transactions to remove data from the channel ● Sink transaction commits only after successful transfer of data ● This ensures no data loss in Flume pipeline
  • 14. Routing and Replicating © 2014 StreamSets, Inc. Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ● Source can replicate or multiplex data across many channels ● Metadata headers can be used to do contextual selection of channels ● Channels can be drained by different sinks to different destinations or pipelines
  • 15. Why Channels? ● Buffers data and insulates downstream from load spikes ● Provides persistent store for data in case the process restarts ● Provides flow ordering* and transactional guarantees © 2014 StreamSets, Inc.
  • 16. © 2014 StreamSets, Inc. Use-Case: Log Aggregation
  • 17. Starting Out Simple ● You would like to move your web-server © 2014 StreamSets, Inc. logs to HDFS ● Let’s assume there are only 3 web servers at the time of launch ● Ad-hoc solution will likely suffice! Challenges ● How do you manage your output paths on HDFS? ● How do you maintain your client code in face of changing environment as well as requirements?
  • 18. Adding a Single Flume Agent Advantages ● Insulation from HDFS downtime ● Quick offload of logs from Web Server machines ● Better Network utilization Challenges ● What if the Flume node goes down? ● Can one Flume node accommodate all load from Web Servers? © 2014 StreamSets, Inc.
  • 19. Adding Two Flume Agents Advantages ● Redundancy and Availability ● Better handling of downstream failures ● Automatic load balancing and failover Challenges ● What happens when new Web Servers are added? ● Can two Flume Agents keep up with all the load from more Web Servers? © 2014 StreamSets, Inc.
  • 20. Handling a Server Farm © 2014 StreamSets, Inc. A Converging Flow ● Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ● Closer a tier is to the destination, larger the batch size it delivers downstream ● Optimized handling of destination systems
  • 21. Data Volume Per Agent © 2014 StreamSets, Inc. Batch Size Variation per Agent ● Event volume is least in the outermost tier ● Event volume increases as the flow converges ● Event volume is highest in the innermost tier
  • 22. Data Volume Per Tier © 2014 StreamSets, Inc. Batch Size Variation per Tier ● In steady state, all tiers carry same event volume ● Transient variations in flow are absorbed and ironed out by channels ● Load spikes are handled smoothly without overwhelming the infrastructure
  • 23. Planning and Sizing Flume Topology for Log-Aggregation Use-Case © 2014 StreamSets, Inc.
  • 24. Planning and Sizing Your Topology What we need to know: ● Number of Web Servers ● Log volume per Web Server per unit time ● Destination System and layout (Routing Requirements) ● Worst case downtime for destination system © 2014 StreamSets, Inc. What we will calculate: ● Number of tiers ● Exit Batch Sizes ● Channel capacity
  • 25. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ● Must handle projected ingest volume ● Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity © 2014 StreamSets, Inc.
  • 26. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations ● Having some extra room is good ● Keep contextual routing in mind ● Consider duplication impact when batch sizes are large Gotchas Load test fail-over scenario to ensure near steady-state drain © 2014 StreamSets, Inc.
  • 27. Calculating Channel Capacity Gotchas © 2014 StreamSets, Inc. Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ● Multiple disks will yield better performance ● Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size
  • 28. To Recap Number of Tiers Calculated with upstream to downstream Agent ration ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Exit Batch Size Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Channel Capacity Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... © 2014 StreamSets, Inc. ...and that’s all there is to it!
  • 29. Some Highlights of Flume ● Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ● Once planned and sized appropriately, Flume will practically run itself without any operational intervention ● Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ● Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ● Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems © 2014 StreamSets, Inc.
  • 30. Things that could be better... ● Handling of poison events ● Ability to tail files ● Ability to handle preset data formats such as JSON, CSV, XML ● Centralized configuration ● Once-only delivery semantics ● ...and more Remember: patches are welcome! © 2014 StreamSets, Inc.
  • 31. Thank You! Contact: ● Email: arvind at streamsets dot com ● Twitter: @aprabhakar More on Flume: ● http://flume.apache.org/ ● User Mailing List: user-subscribe@flume.apache.org ● Developer Mailing List: dev-subscribe@flume.apache.org ● JIRA: https://issues.apache.org/jira/browse/FLUME © 2014 StreamSets, Inc.