Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 StreamSets Inc., All rights reserved
Apache Flume
Data Aggregation At Scale
Arvind Prabhakar
© 2015 StreamSets, Inc.
Who am I?
❏  Founder/CTO
Apache Software Foundation
❏  Flume - PMC Chair
❏  Sqoop - PMC Chair
❏  S...
© 2015 StreamSets, Inc.
What is Flume?
Logs
Files
Click
Streams
Sensors
Devices
Database
Logs
Social Data
Streams
Feeds
Ot...
© 2015 StreamSets, Inc.
...for Big Data ecosystem?
“Big data is an all-encompassing term for any collection
of data sets s...
© 2015 StreamSets, Inc.
Physically Distributed Data Sources
●  Many physical sources
that produce data
●  Number of physic...
© 2015 StreamSets, Inc.
“Every two days now we create as much information
as we did from the dawn of civilization up until...
© 2015 StreamSets, Inc.
A Notable Prediction From 2014...
“Breakthrough in
compression technology
will succeed in convert
...
© 2015 StreamSets, Inc.
Ever Changing Structure of Data
●  One of your data centers
upgrade to IPv6
192.168.0.4 fe80::21b:...
© 2015 StreamSets, Inc.
So, from Data Ingestion Perspective:
Massive collection of ever changing physical sources...
Never...
© 2015 StreamSets, Inc.
© 2015 StreamSets, Inc.
Flume to the Rescue!
© 2015 StreamSets, Inc.
Apache Flume
●  Originally designed to be a log
aggregation system by
Cloudera Engineers
●  Evolve...
© 2015 StreamSets, Inc.
A Closer Look at Flume
Agent Agent AgentAgentInput Destination
●  Distributed Pipeline Architectur...
© 2015 StreamSets, Inc.
Anatomy of a Flume Agent
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Source
●  Acc...
© 2015 StreamSets, Inc.
Transactional Data Exchange
Flume Agent
Source
Sink
Channel
Incoming
Data
Outgoing
Data
Source TX
...
© 2015 StreamSets, Inc.
Routing and Replicating
Flume Agent
Source
Sink 1
Channel 1
Incoming
Data
Outgoing Data
Channel 2
...
© 2015 StreamSets, Inc.
Why Channels?
●  Buffers data and insulates downstream from load spikes
●  Provides persistent sto...
© 2015 StreamSets, Inc.
Use-Case: Log Aggregation
© 2015 StreamSets, Inc.
Starting Out Simple
●  You would like to move your web-
server logs to HDFS
●  Let’s assume there ...
© 2015 StreamSets, Inc.
Adding a Single Flume Agent
Advantages
●  Insulation from HDFS downtime
●  Quick offload of logs f...
© 2015 StreamSets, Inc.
Adding Two Flume Agents
Advantages
●  Redundancy and Availability
●  Better handling of downstream...
© 2015 StreamSets, Inc.
Handling a Server Farm
A Converging Flow
●  Traffic is aggregated by Tier-2 and
Tier-3 before bein...
© 2015 StreamSets, Inc.
Data Volume Per Agent
Batch Size Variation per Agent
●  Event volume is least in the
outermost tie...
© 2015 StreamSets, Inc.
Data Volume Per Tier
Batch Size Variation per Tier
●  In steady state, all tiers carry
same event ...
© 2015 StreamSets, Inc.
Planning and Sizing Flume Topology
for Log-Aggregation Use-Case
© 2015 StreamSets, Inc.
Planning and Sizing Your Topology
What we know:
●  Number of Web Servers
●  Log volume per Web
Ser...
© 2015 StreamSets, Inc.
Calculating Number of Tiers
Rule of Thumb
One Aggregating Agent (A) can be used with
anywhere from...
© 2015 StreamSets, Inc.
Calculating Exit Batch Size
Rule of Thumb
Exit batch size is same as total exit data volume
divide...
© 2015 StreamSets, Inc.
Calculating Channel Capacity
Source
Sink
X
Source
Sink
X
X
Rule of Thumb
Equal to worst case data ...
© 2015 StreamSets, Inc.
To Recap
Calculated with upstream to downstream Agent ratio ranging from 4:1 to
16:1. Factor in ro...
© 2015 StreamSets, Inc.
© 2015 StreamSets, Inc.
Some Highlights of Flume
●  Flume is suitable for large volume data collection, especially when
da...
© 2015 StreamSets, Inc.
Things that could be better...
●  Handling of poison events
●  Ability to tail files
●  Ability to...
© 2015 StreamSets, Inc.
Thank You!
Contact:
●  Email: arvind at streamsets dot com
●  Twitter: @aprabhakar
More on Flume:
...
Upcoming SlideShare
Loading in …5
×

Apache Flume - DataDayTexas

5,517 views

Published on

Learn how you can easily use Apache Flume to solve your complex ingestion problems.

Published in: Software
  • Hi there! Get Your Professional Job-Winning Resume Here - Check our website! http://bit.ly/resumpro
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • hi all the slides are very nice and hands-on part is also required for better understanding of apache flume follow the link http://beyondcorner.com/learn-apache-flume/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Flume - DataDayTexas

  1. 1. © 2015 StreamSets Inc., All rights reserved Apache Flume Data Aggregation At Scale Arvind Prabhakar
  2. 2. © 2015 StreamSets, Inc. Who am I? ❏  Founder/CTO Apache Software Foundation ❏  Flume - PMC Chair ❏  Sqoop - PMC Chair ❏  Storm - PMC, Committer ❏  MetaModel - Mentor ❏  Sentry - Mentor ❏  NiFi - Mentor ❏  ASF Member Previously... ❏  Cloudera ❏  Informatica Arvind Prabhakar
  3. 3. © 2015 StreamSets, Inc. What is Flume? Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ●  open-source, ●  reliable, ●  scalable, ●  manageable, ●  customizable, ...and designed for Big Data ecosystem.
  4. 4. © 2015 StreamSets, Inc. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ●  Logical Data Sources are physically distributed ●  Data production is continuous / never ending ●  Data structure and semantics change without notice
  5. 5. © 2015 StreamSets, Inc. Physically Distributed Data Sources ●  Many physical sources that produce data ●  Number of physical sources changes constantly ●  Sources may exist in different governance zones, data centers, continents...
  6. 6. © 2015 StreamSets, Inc. “Every two days now we create as much information as we did from the dawn of civilization up until 2003” -  Eric Schmidt, 2010 Continuous Data Production ●  Weather ●  Traffic ●  Automobiles ●  Trains ●  Airplanes ●  Geological/Seismic ●  Oceanographic ●  Smart Phones ●  Health Accessories ●  Medical Devices ●  Home Automation ●  Digital Cameras ●  Social Media ●  Geolocation ●  Shop Floor Sensors ●  Network Activity ●  Industry Appliances ●  Security/Surveillance ●  Server Workloads ●  Digital Telephony ●  Bio-simulations...
  7. 7. © 2015 StreamSets, Inc. A Notable Prediction From 2014... “Breakthrough in compression technology will succeed in convert #bigdata to regular data #2014BigDataPredictions” http://bit.ly/1x6cz9M 1:41 PM - 20 Dec 2013
  8. 8. © 2015 StreamSets, Inc. Ever Changing Structure of Data ●  One of your data centers upgrade to IPv6 192.168.0.4 fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ●  Application developer changes logs (again) ●  JSON data may contain more attributes than expected
  9. 9. © 2015 StreamSets, Inc. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Continuously evolving data structures and semantics...
  10. 10. © 2015 StreamSets, Inc.
  11. 11. © 2015 StreamSets, Inc. Flume to the Rescue!
  12. 12. © 2015 StreamSets, Inc. Apache Flume ●  Originally designed to be a log aggregation system by Cloudera Engineers ●  Evolved to handle any type of streaming event data ●  Low-cost of installation, operation and maintenance ●  Highly customizable and extendable
  13. 13. © 2015 StreamSets, Inc. A Closer Look at Flume Agent Agent AgentAgentInput Destination ●  Distributed Pipeline Architecture ●  Optimized for commonly used data sources and destinations ●  Built in support for contextual routing ●  Fully customizable and extendable eg Web Logs eg HDFS
  14. 14. © 2015 StreamSets, Inc. Anatomy of a Flume Agent Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ●  Accepts incoming Data ●  Scales as required ●  Writes data to Channel Sink ●  Removes data from Channel ●  Sends data to downstream Agent or Destination Channel ●  Stores data in the order received
  15. 15. © 2015 StreamSets, Inc. Transactional Data Exchange Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ●  Source uses transactions to write to the channel Upstream Sink TX ●  Sink uses transactions to remove data from the channel ●  Sink transaction commits only after successful transfer of data ●  This ensures no data loss in Flume pipeline
  16. 16. © 2015 StreamSets, Inc. Routing and Replicating Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ●  Source can replicate or multiplex data across many channels ●  Metadata headers can be used to do contextual selection of channels ●  Channels can be drained by different sinks to different destinations or pipelines
  17. 17. © 2015 StreamSets, Inc. Why Channels? ●  Buffers data and insulates downstream from load spikes ●  Provides persistent store for data in case the process restarts ●  Provides flow ordering* and transactional guarantees
  18. 18. © 2015 StreamSets, Inc. Use-Case: Log Aggregation
  19. 19. © 2015 StreamSets, Inc. Starting Out Simple ●  You would like to move your web- server logs to HDFS ●  Let’s assume there are only 3 web servers at the time of launch ●  Ad-hoc solution will likely suffice! Challenges ●  How do you manage your output paths on HDFS? ●  How do you maintain your client code in face of changing environment as well as requirements?
  20. 20. © 2015 StreamSets, Inc. Adding a Single Flume Agent Advantages ●  Insulation from HDFS downtime ●  Quick offload of logs from Web Server machines ●  Better Network utilization Challenges ●  What if the Flume node goes down? ●  Can one Flume node accommodate all load from Web Servers?
  21. 21. © 2015 StreamSets, Inc. Adding Two Flume Agents Advantages ●  Redundancy and Availability ●  Better handling of downstream failures ●  Automatic load balancing and failover Challenges ●  What happens when new Web Servers are added? ●  Can two Flume Agents keep up with all the load from more Web Servers?
  22. 22. © 2015 StreamSets, Inc. Handling a Server Farm A Converging Flow ●  Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ●  Closer a tier is to the destination, larger the batch size it delivers downstream ●  Optimized handling of destination systems
  23. 23. © 2015 StreamSets, Inc. Data Volume Per Agent Batch Size Variation per Agent ●  Event volume is least in the outermost tier ●  Event volume increases as the flow converges ●  Event volume is highest in the innermost tier
  24. 24. © 2015 StreamSets, Inc. Data Volume Per Tier Batch Size Variation per Tier ●  In steady state, all tiers carry same event volume ●  Transient variations in flow are absorbed and ironed out by channels ●  Load spikes are handled smoothly without overwhelming the infrastructure
  25. 25. © 2015 StreamSets, Inc. Planning and Sizing Flume Topology for Log-Aggregation Use-Case
  26. 26. © 2015 StreamSets, Inc. Planning and Sizing Your Topology What we know: ●  Number of Web Servers ●  Log volume per Web Server per unit time ●  Destination System and layout (Routing Requirements) ●  Worst case downtime for destination system What we will calculate: ●  Number of tiers ●  Exit Batch Sizes ●  Channel capacity
  27. 27. © 2015 StreamSets, Inc. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ●  Must handle projected ingest volume ●  Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity
  28. 28. © 2015 StreamSets, Inc. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations Gotchas Load test fail-over scenario to ensure near steady-state drain ●  Having some extra room is good ●  Keep contextual routing in mind ●  Consider duplication impact when batch sizes are large
  29. 29. © 2015 StreamSets, Inc. Calculating Channel Capacity Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ●  Multiple disks will yield better performance ●  Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size Gotchas
  30. 30. © 2015 StreamSets, Inc. To Recap Calculated with upstream to downstream Agent ratio ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Number of Tiers Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Exit Batch Size Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... Channel Capacity ...and that’s all there is to it!
  31. 31. © 2015 StreamSets, Inc.
  32. 32. © 2015 StreamSets, Inc. Some Highlights of Flume ●  Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ●  Once planned and sized appropriately, Flume will practically run itself without any operational intervention ●  Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ●  Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ●  Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems
  33. 33. © 2015 StreamSets, Inc. Things that could be better... ●  Handling of poison events ●  Ability to tail files ●  Ability to handle preset data formats such as JSON, CSV, XML ●  Centralized configuration ●  Once-only delivery semantics ●  ...and more Remember: patches are welcome!
  34. 34. © 2015 StreamSets, Inc. Thank You! Contact: ●  Email: arvind at streamsets dot com ●  Twitter: @aprabhakar More on Flume: ●  http://flume.apache.org/ ●  User Mailing List: user-subscribe@flume.apache.org ●  Developer Mailing List: dev-subscribe@flume.apache.org ●  JIRA: https://issues.apache.org/jira/browse/FLUME

×