Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Aggregation At Scale Using Apache Flume

These are slides from my presentation at BigData TechCon SF on 10/29/2014. Hope you find these useful!

  • Be the first to comment

Data Aggregation At Scale Using Apache Flume

  1. 1. Apache Flume Data Aggregation At Scale Arvind Prabhakar © 2014 StreamSets Inc., All rights reserved © 2014 StreamSets, Inc.
  2. 2. Who am I? © 2014 StreamSets, Inc. ❏ Founder/CTO Apache Software Foundation ❏ Flume - PMC Chair ❏ Sqoop - PMC Chair ❏ Storm - PMC, Committer ❏ MetaModel - Mentor ❏ Sentry - Mentor ❏ ASF Member Previously... ❏ Cloudera ❏ Informatica
  3. 3. What is Flume? © 2014 StreamSets, Inc. Logs Files Click Streams Sensors Devices Database Logs Social Data Streams Feeds Other Raw Storage (HDFS, S3) EDW, NoSQL (Hive, Impala, HBase, Cassandra) Search (Solr, ElasticSearch) Enterprise Data Infrastructure Apache Flume is a continuous data ingestion system that is... ● open-source, ● reliable, ● scalable, ● manageable, ● customizable, ...and designed for Big Data ecosystem.
  4. 4. ...for Big Data ecosystem? “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.” Big Data from a Data Ingestion Perspective ● Logical Data Sources are physically distributed ● Data production is continuous / never ending ● Data structure and semantics change without notice © 2014 StreamSets, Inc.
  5. 5. Physically Distributed Data Sources © 2014 StreamSets, Inc. ● Many physical sources that produce data ● Number of physical sources changes constantly ● Sources may exist in different governance zones, data centers, continents...
  6. 6. Continuous Data Production “Every two days now we create as much information as we did from the dawn of civilization up until 2003” © 2014 StreamSets, Inc. - Eric Schmidt, 2010 ● Weather ● Traffic ● Automobiles ● Trains ● Airplanes ● Geological/Seismic ● Oceanographic ● Smart Phones ● Health Accessories ● Medical Devices ● Home Automation ● Digital Cameras ● Social Media ● Geolocation ● Shop Floor Sensors ● Network Activity ● Industry Appliances ● Security/Surveillance ● Server Workloads ● Digital Telephony ● Bio-simulations...
  7. 7. Ever Changing Structure of Data ● One of your data centers upgrade to IPv6 © 2014 StreamSets, Inc. fe80::21b:21ff:fe83:90fa M0137: User {jonsmith} granted access to {accounts} M0137: [jonsmith] granted access to [sys.accounts] { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121” } { “first”:”jon”, “last”:”smith”, “add1”:”123 Main St.”, “add2”:”Ste - 4”, “city”:”Little Town”, “state”:”AZ”, “zip”: “12121”, “phone”: “(408) 555-1212” } ● Application developer changes logs (again) ● JSON data may contain more attributes than expected
  8. 8. So, from Data Ingestion Perspective: Massive collection of ever changing physical sources... Never ending data production... Data structure and semantics evolve continuously... © 2014 StreamSets, Inc.
  9. 9. © 2014 StreamSets, Inc. Flume to the Rescue!
  10. 10. Apache Flume ● Originally designed to be a log aggregation system by Cloudera Engineers ● Evolved to handle any type of streaming event data ● Low-cost of installation, operation and maintenance ● Highly customizable and extendable © 2014 StreamSets, Inc.
  11. 11. A Closer Look at Flume Input Agent Agent Agent Agent Destination ● Distributed Pipeline Architecture ● Optimized for commonly used data sources and destinations ● Built in support for contextual routing ● Fully customizable and extendable © 2014 StreamSets, Inc.
  12. 12. Anatomy of a Flume Agent © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source ● Accepts incoming Data ● Scales as required ● Writes data to Channel Sink ● Removes data from Channel ● Sends data to downstream Agent or Destination Channel ● Stores data in the order received
  13. 13. Transactional Data Exchange Upstream Sink TX © 2014 StreamSets, Inc. Flume Agent Source Sink Channel Incoming Data Outgoing Data Source TX Sink TX ● Source uses transactions to write to the channel ● Sink uses transactions to remove data from the channel ● Sink transaction commits only after successful transfer of data ● This ensures no data loss in Flume pipeline
  14. 14. Routing and Replicating © 2014 StreamSets, Inc. Flume Agent Source Sink 1 Channel 1 Incoming Data Outgoing Data Channel 2 Sink 2 Outgoing Data ● Source can replicate or multiplex data across many channels ● Metadata headers can be used to do contextual selection of channels ● Channels can be drained by different sinks to different destinations or pipelines
  15. 15. Why Channels? ● Buffers data and insulates downstream from load spikes ● Provides persistent store for data in case the process restarts ● Provides flow ordering* and transactional guarantees © 2014 StreamSets, Inc.
  16. 16. © 2014 StreamSets, Inc. Use-Case: Log Aggregation
  17. 17. Starting Out Simple ● You would like to move your web-server © 2014 StreamSets, Inc. logs to HDFS ● Let’s assume there are only 3 web servers at the time of launch ● Ad-hoc solution will likely suffice! Challenges ● How do you manage your output paths on HDFS? ● How do you maintain your client code in face of changing environment as well as requirements?
  18. 18. Adding a Single Flume Agent Advantages ● Insulation from HDFS downtime ● Quick offload of logs from Web Server machines ● Better Network utilization Challenges ● What if the Flume node goes down? ● Can one Flume node accommodate all load from Web Servers? © 2014 StreamSets, Inc.
  19. 19. Adding Two Flume Agents Advantages ● Redundancy and Availability ● Better handling of downstream failures ● Automatic load balancing and failover Challenges ● What happens when new Web Servers are added? ● Can two Flume Agents keep up with all the load from more Web Servers? © 2014 StreamSets, Inc.
  20. 20. Handling a Server Farm © 2014 StreamSets, Inc. A Converging Flow ● Traffic is aggregated by Tier-2 and Tier-3 before being put into destination system ● Closer a tier is to the destination, larger the batch size it delivers downstream ● Optimized handling of destination systems
  21. 21. Data Volume Per Agent © 2014 StreamSets, Inc. Batch Size Variation per Agent ● Event volume is least in the outermost tier ● Event volume increases as the flow converges ● Event volume is highest in the innermost tier
  22. 22. Data Volume Per Tier © 2014 StreamSets, Inc. Batch Size Variation per Tier ● In steady state, all tiers carry same event volume ● Transient variations in flow are absorbed and ironed out by channels ● Load spikes are handled smoothly without overwhelming the infrastructure
  23. 23. Planning and Sizing Flume Topology for Log-Aggregation Use-Case © 2014 StreamSets, Inc.
  24. 24. Planning and Sizing Your Topology What we need to know: ● Number of Web Servers ● Log volume per Web Server per unit time ● Destination System and layout (Routing Requirements) ● Worst case downtime for destination system © 2014 StreamSets, Inc. What we will calculate: ● Number of tiers ● Exit Batch Sizes ● Channel capacity
  25. 25. Calculating Number of Tiers Rule of Thumb One Aggregating Agent (A) can be used with anywhere from 4 to 16 client Agents Considerations ● Must handle projected ingest volume ● Resulting number of tiers should provide for routing, load-balancing and failover requirements Gotchas Load test to ensure that steady state and peak load are addressed with adequate failover capacity © 2014 StreamSets, Inc.
  26. 26. Calculating Exit Batch Size Rule of Thumb Exit batch size is same as total exit data volume divided by number of Agents in a tier Considerations ● Having some extra room is good ● Keep contextual routing in mind ● Consider duplication impact when batch sizes are large Gotchas Load test fail-over scenario to ensure near steady-state drain © 2014 StreamSets, Inc.
  27. 27. Calculating Channel Capacity Gotchas © 2014 StreamSets, Inc. Source Sink X Source Sink X X Rule of Thumb Equal to worst case data ingest rate sustained over the worst case downstream outage interval Considerations ● Multiple disks will yield better performance ● Channel size impacts the back-pressure buildup in the pipeline You may need more disk space than the physical footprint of the data size
  28. 28. To Recap Number of Tiers Calculated with upstream to downstream Agent ration ranging from 4:1 to 16:1. Factor in routing, failover, load-balancing requirements... Exit Batch Size Calculated for steady state data volume exiting the tier, divided by number of Agents in that tier. Factor in contextual routing and duplication due to transient failure impact... Channel Capacity Calculated as worst case ingest rate sustained over the worst case downstream downtime. Factor in number of disks used etc... © 2014 StreamSets, Inc. ...and that’s all there is to it!
  29. 29. Some Highlights of Flume ● Flume is suitable for large volume data collection, especially when data is being produced in multiple locations ● Once planned and sized appropriately, Flume will practically run itself without any operational intervention ● Flume provides weak ordering guarantee, i.e., in the absence of failures the data will arrive in the order it was received in the Flume pipeline ● Transactional exchange ensures that Flume never loses any data in transit between Agents. Sinks use transactions to ensure data is not lost at point of ingest or terminal destinations. ● Flume has rich out-of-the box features such as contextual routing, and support for popular data sources and destination systems © 2014 StreamSets, Inc.
  30. 30. Things that could be better... ● Handling of poison events ● Ability to tail files ● Ability to handle preset data formats such as JSON, CSV, XML ● Centralized configuration ● Once-only delivery semantics ● ...and more Remember: patches are welcome! © 2014 StreamSets, Inc.
  31. 31. Thank You! Contact: ● Email: arvind at streamsets dot com ● Twitter: @aprabhakar More on Flume: ● ● User Mailing List: ● Developer Mailing List: ● JIRA: © 2014 StreamSets, Inc.

    Be the first to comment

    Login to see the comments

  • ssuser4a734e

    Oct. 31, 2014
  • allenjoe1986

    Nov. 2, 2014
  • GioeleLuchetti

    Nov. 10, 2014
  • ssuserde8176

    Nov. 15, 2014
  • dausi1

    Nov. 24, 2014
  • hemanthabbina

    Dec. 15, 2014
  • bsj45

    Jan. 13, 2015
  • allenkk

    Jan. 24, 2015
  • prayagupd

    Sep. 18, 2015
  • ArunWagle

    Nov. 13, 2015
  • malloc

    Apr. 4, 2016
  • ForestLin2

    Apr. 28, 2016
  • oeegee

    Jul. 13, 2016
  • JimMcLean6

    Jul. 14, 2016
  • VietDo1

    Jan. 20, 2017
  • manasasista

    Feb. 28, 2017
  • SpandanaReddy28

    Mar. 30, 2017
  • MichaelLi100

    Apr. 1, 2017
  • RahulDeshmukh38

    Jul. 28, 2019
  • ksmin23

    Mar. 16, 2020

These are slides from my presentation at BigData TechCon SF on 10/29/2014. Hope you find these useful!


Total views


On Slideshare


From embeds


Number of embeds