Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Flume

5,487 views

Published on

An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.

Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Flume

  1. 1. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Flume
  2. 2. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation
  3. 3. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You have a lot of servers and systems ■ network devices ■ operating systems ■ web servers ■ applications
  4. 4. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation They generate a lot of logs and other data
  5. 5. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation I have business idea how to use this data!
  6. 6. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You have Hadoop cluster running
  7. 7. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Motivation You want to move the logs to Hadoop
  8. 8. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Traditional Solutions ■ Own scripts ● Probably a combination of ■ ■ ■ and/or / ● Cron or start/stop manually ● Hardcoded or missing configuration ● Tightly-coupled with data that is transferred
  9. 9. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Complications ■ High delays ■ Limited manageability ● Compression, encryptions, various file formats ● Throughput ● Configuration and monitoring ■ Limited scalability ● Data explosion, Failover, Load balancing
  10. 10. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Apache Flume ■ Aims to solve this problem! ■ It can move large amounts of streaming event data from one place to another ● e.g. from web servers to Hadoop cluster
  11. 11. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Overview
  12. 12. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Various systems that constantly generate data in form of events
  13. 13. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent ■ Installed on each node ■ Collects events
  14. 14. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor ■ Filters useless events ■ Decorates events by adding metadata ● e.g. timestamp, hostname, UUID, static markers
  15. 15. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Encrypts events in a file on disk
  16. 16. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Sends events to next-hop Flume agent
  17. 17. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Compression is supported Compress
  18. 18. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be multiplexed to multiple agents (to spread the load) …. A B A,B Compress
  19. 19. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … or replicated for redundancy A,B A,B A,B Compress
  20. 20. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent to survive a permanent failure of an agent, disk or node. A,B A,B A,B Compress
  21. 21. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be also delivered in “failover” mode where … C,D C,D Compress 1
  22. 22. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent C,D C,D Compress 1 … in case of a failure of the next- hop agent …
  23. 23. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … we try next Agent(s) on a prioritized list. C,D C,D Compress 2
  24. 24. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be also load- balanced (round robin, random and custom) … E,F E,F Compress 1
  25. 25. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent … and go to different next- hop Agents to spread the load E,F G,H G,H Compress 2
  26. 26. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be stored ■ in memory (for performance) ■ on disk (for durability) E,F G,H Compress
  27. 27. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent Events can be finally transferred to HDFS ■ Multiple file formats e.g. Text, JSON, Avro ■ Compression supported ■ Flexible names of HDFS path Compress
  28. 28. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Interceptor Encrypt Flume Agent Flume Agent Flume Agent However, ■ Many destinations are supported ■ One can implement a custom one ones Compress
  29. 29. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume ■ Distributed ● Agents installed on many machines ■ Scalable ● Add more machines to transfer more events ■ Reliable ● Durable storage, failover and/or replication ■ Manageable ● Easy to install, configure, reconfigure and run
  30. 30. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume ■ Nicely integrated with the Hadoop Ecosystem ● Various destinations e.g. HDFS, HBase ● Various file formats e.g. Avro, SequenceFile ■ Extensible ● Possibility to add new functionality e.g. source and destination for events
  31. 31. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Architecture
  32. 32. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Event ■ Unit of data transported by Flume Headers Payload Generally small You can add own headers e.g. hostname, timestamp
  33. 33. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Flume Agent ■ Responsible for transferring events ■ Runs in JVM ■ Consists of Source(s), Channel(s) and Sink(s) Source SinkChannel
  34. 34. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Source ■ Collects and forwards events in channels ● HTTP, JMS, RPC, NetCat ● Exec ● Spooling directory Source SinkChannel
  35. 35. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Exec Source ■ Runs a given Unix command on startup ● Should continuously run and produce data on ■ If the process exits, the source also exits and will NOT produce any further data Source SinkChannel
  36. 36. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Spooling Directory Source ■ Watches a specified directory for new files ■ Parses events out of new files as they appear ■ After a file has been fully processed, it is renamed to indicate completion (or optionally deleted) Source SinkChannel
  37. 37. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Channel ■ Buffers incoming events until they are extracted by Sinks ■ Tradeoff between durability and throughput ● Memory ● File ● JDBC Source SinkChannel
  38. 38. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Memory Channel ■ Events stored in an in-memory queue ■ Configurable capacity ● The maximum number of events and/or bytes in memory ■ Nondurable, but faster Source SinkChannel
  39. 39. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. File Channel ■ Events stored in file on disk ● Durable ■ Flushes to disk at the end of each transaction ● Supports encryption ■ Configurable capacity Source SinkChannel
  40. 40. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. File Channel ■ The more disks ● The better performance ● The higher capacity ■ Can be limited by the amount of memory for in-memory queue that keeps pointers to all events stored in log files Source SinkChannel
  41. 41. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Sink ■ Removes events from a Channel and forwards them to their next destination ● HDFS, HBase, Solr, ElasticSearch ● File, Logger ● Flume Agent Source SinkChannel
  42. 42. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Writes events to HDFS ● Flexible naming of HDFS paths ■ Multiple file formats are supported e.g. Text, Avro Source SinkChannel
  43. 43. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Rollover properties for files generated by HDFS Sink Number of seconds to wait before rolling a file (0 deactivates this feature) File size, in bytes, to trigger roll of a file (0 deactivates this feature) Number of events written to file before rolling a file (0 deactivates this feature) Timeout after which inactive files get closed (0 deactivates this feature) Number of events written to file before they are flushed to HDFS
  44. 44. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. HDFS Sink ■ Rolling a file will generate many small files ● Need to compact them to avoid an explosion of HDFS metadata ■ Often, you also want to deduplicate, filter and split events Source SinkChannel
  45. 45. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Sends a batch of Avro events to a configured hostname:port
  46. 46. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Listens to events on a given port Sends a batch of Avro events to a configured hostname:port
  47. 47. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Listens to events on a given port Sends a batch of Avro events to a configured hostname:port
  48. 48. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Avro Source And Sink Source Avro SinkChannel Avro Source SinkChannel Compress Encrypt Listens to events on a given port Sends a batch of Avro events to a configured hostname:port Optionally
  49. 49. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability ■ Durable channels ● Survive the agent failure, machine restarts or non disk- related failures ■ Redundant path in a workflow topology ● Survive the failure of a node ● Achieved via replication or failover
  50. 50. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability ■ Sufficient capacity of the channels ● Minimize the back pressure on earlier points in the flow ● Some sources might not be able to resend the data e.g. ■ Exec Source does not handle failures and might lose the data ■ Spooling Directory Source offers reliability guarantees
  51. 51. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel Avro Source SinkChannel D, C, B, A Start the transaction1
  52. 52. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C Take a batch of events2 B, A Avro Source SinkChannel
  53. 53. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C Send a batch of events3 B, A Avro Source SinkChannel
  54. 54. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannelStart the transaction 4
  55. 55. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannel Put events into a channel 5 B, A
  56. 56. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel D, C B, A Avro Source SinkChannelStop the transaction 6 B, A
  57. 57. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Reliability Source Avro SinkChannel Avro Source SinkChannel Stop the transaction7 B, A D, C
  58. 58. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. When Flume Is Not A Good Fit ■ Very large events ● An event cannot be larger than memory or a disk on an agent’s machine ■ Infrequent bulks loads ● Other tools might be better e.g. HDFS File Slurper
  59. 59. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration
  60. 60. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration ■ Simple format ■ A configuration file can contain configuration settings for many Agents ● Only settings needed by the Agent will be loaded ■ Agent automatically reloads configuration if it changes
  61. 61. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Configuration Example ■ We configure Flume to run a single agent that 1. listens for data on a given port 2. turns each line of incoming text into an event 3. and sends to HDFS via the in-memory channel.
  62. 62. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Agent ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  63. 63. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Source ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  64. 64. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Channel ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  65. 65. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Sink ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  66. 66. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. Starting Agent ● Simple format ● A configuration file can contain configuration settings for many Agents ○ Only settings needed by the Agent will be loaded ● Agent automatically reloads configuration if it changes
  67. 67. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. There Is More!
  68. 68. © Copyright 2014 GetInData. All rights reserved. Not to be reproduced without prior written consent. GetInData ■ Data consulting company ■ We help you benefit from data ● Look at our portfolio: http://getindata.com/portfolio ● Find our trainings: http://getindata.com/trainings ● Learn more about our team: http://getindata.com/team

×