SlideShare a Scribd company logo
1 of 17
APACHE FLUME
Data pipelining and loading with Flume
Agenda
1. What is Flume?
1. Overview
2. Architecture
2. Down the rabbit hole
1. Events
2. Sources
3. Channels
4. Sinks
5. Configuration
6. Data manipulation with interceptors
7. Multiplexing
8. Customer Serializers
3. Use Cases
1. What is Flume?
• Flume is a data collection and aggregation framework,
which operates in distributed topologies.
• It is stream - oriented, fault tolerant and offers linear
scalability.
• It offers low latency with high throughput
• Its configuration is declarative, yet allows easy
extensibility
• It is quite mature – it has been around for quite a while,
enjoys support from multiple vendors with thousands of
large scale deployments.
1. What is Flume?
Flume defines a simple pipeline structure with three roles:
• Source
• Channel
• Sinks
Sources define where data comes from, e.g. a file, a
message Queue (Kafka, JMS)
Channels are pipes connecting Sources with Sinks
Sinks are the destination of the data pipelined from
Sources
1. What is Flume?
These three roles run within a JVM process called Agents.
Data is packed into Events, which are a simple Avro
wrapper around any type of record, typically a line from a
log file.
1. What is Flume?
Agents can be chained into a
tree structure to allow multiple
collection from various sources,
concentrated and piped to one
destination.
• Shown here:
• Many agents (1 per host) at
edge tier
• 4 “collector” agents in that
data center
• 3 centralized agents writing
to HDFS
2. Down the rabbit hole - Sources
• Event-driven or polling-based
• Most sources can accept batches of events
• Stock source implementations:
• Spool Directory
• HTTP
• Kafka
• JMS
• Netcat
• AvroRPC
• AvroRPC is use by Flume to communicate between
Agents
2. Down the rabbit hole - Channels
• Passive “Glue” between Sources and Sinks
• Channel type determines the reliability guarantees
• Stock channel types:
• JDBC – has performance issues
• Memory – lower latency for small writes, but not durable
• File – provides durability; most people use this
• New: Kafka – topics operate as channels, durable and fast
2. Down the rabbit hole - Sinks
• All sinks are polling-based
• Most sinks can process batches of events at a time
• Stock sink implementations:
• Kafka
• HDFS
• HBase (2 variants – Sinc and Async)
• Solr
• ElasticSearch
• Avro-RPC, Thrift-RPC – Flume agent inter-connect
• File Roller – write to local filesystem
• Null, Logger– for testing purposes
2. Down the rabbit hole - Pipeline
• Source: Puts events into the local channel
• Channel: Store events until someone takes them
• Sink: Takes events from the local channel
• On failure, sinks backoff and retry forever until success
2. Down the rabbit hole - Configuration
• Declarative configuration
with property files
• Define your source
• Define the sink
• Declare the channel for
source and sink
####### Spool Config
SpoolAgent.sources = MySpooler
SpoolAgent.channels = MemChannel
SpoolAgent.sinks = HDFS
SpoolAgent.channels.MemChannel.type = memory
SpoolAgent.channels.MemChannel.capacity = 500
SpoolAgent.channels.MemChannel.transactionCapacity = 200
SpoolAgent.sources.MySpooler.channels = MemChannel
SpoolAgent.sources.MySpooler.type = spooldir
SpoolAgent.sources.MySpooler.spoolDir = /var/log/
SpoolAgent.sources.MySpooler.fileHeader = false
SpoolAgent.sinks.HDFS.channel = MemChannel
SpoolAgent.sinks.HDFS.type = hdfs
SpoolAgent.sinks.HDFS.hdfs.path = hdfs://cluster/data/spool/
SpoolAgent.sinks.HDFS.hdfs.fileType = DataStream
SpoolAgent.sinks.HDFS.hdfs.writeFormat = Text
SpoolAgent.sinks.HDFS.hdfs.batchSize = 100
SpoolAgent.sinks.HDFS.hdfs.rollSize = 0
SpoolAgent.sinks.HDFS.hdfs.rollCount = 0
SpoolAgent.sinks.HDFS.hdfs.rollInterval = 300
2. Down the rabbit hole - Interceptors
• Incoming data can be transformed and enriched at the
source
• One or more interceptors can modify your events and set
headers
• These headers can be used for sorting within sinks and
routing to different sinks
• Stock Interceptor Implementations
• Timestamp (adds a timestamp header)
• Hostname (adds the flume host as a header)
• Regex (extracts the applied regex to a header)
• Static (sets a static constant header)
2. Down the rabbit hole - Interceptors
• Custom Interceptors can be configured by implementing
• org.apache.flume.interceptor.Interceptor
• org.apache.flume.interceptor.Interceptor.Builder
• Pass config params through flume’s Context:
public void configure(Context context) {
context.getString(“prop”)
}
Use Cases:
• augment or change Event bodies, i.e. record conversion
• Add headers for routing or sorting
2. Down the rabbit hole - Selectors
• Multiplexing selectors allows header sensitive dispatch to
specific destination sinks.
• Replicating selectors allow simultaneous writes to multiple
sinks.
2. Down the rabbit hole - Serializers
• Serializers customize the format written to sink.
• Built-in serializers are generally fine for text
• More complicated use case such as Avro require
customer serializers
• HDFS, HBase and FileRoller Sinks support customer
Serializers.
• Custom Serializers implement specific interfaces / extend
abstract classes:
• AsyncHbaseEventSerializer (HBase)
• AbstractAvroEventSerializer (HDFS)
• EventSerializer (Generic)
2. Use Cases
Flume is great for:
• Continuous, sorted writing to HDFS from message queues
• Log aggregation from various sources
• On the fly format conversion – e.g. text -> Avro
• Elastic Search loading from upstream sources
• Automated Ingest from external providers
Keep in mind:
• Flume isn’t a stream processing framework – complex processing
is a no go
• Events are pipelined one to one – with some exceptions, you
generally can’t create several new events from one incoming event
Thanks!

More Related Content

What's hot

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used forAljoscha Krettek
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - RangerIsheeta Sanghi
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQ
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQAMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQ
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQMicrosoft
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!Guido Schmutz
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyftTao Feng
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Worksconfluent
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 

What's hot (20)

Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Spark
SparkSpark
Spark
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Apache Hadoop Security - Ranger
Apache Hadoop Security - RangerApache Hadoop Security - Ranger
Apache Hadoop Security - Ranger
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQ
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQAMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQ
AMQP: interopérabilité et découplage de systèmes hétérogènes avec RabbitMQ
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Airflow at lyft
Airflow at lyftAirflow at lyft
Airflow at lyft
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
How Apache Kafka® Works
How Apache Kafka® WorksHow Apache Kafka® Works
How Apache Kafka® Works
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 

Viewers also liked

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Semicolon 2013 presentation
Semicolon 2013 presentationSemicolon 2013 presentation
Semicolon 2013 presentationPandurang Kamat
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 

Viewers also liked (9)

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Semicolon 2013 presentation
Semicolon 2013 presentationSemicolon 2013 presentation
Semicolon 2013 presentation
 
HadoopFileFormats_2016
HadoopFileFormats_2016HadoopFileFormats_2016
HadoopFileFormats_2016
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 

Similar to Apache flume - an Introduction

Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopInSemble
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureInSemble
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01joahp
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationKnoldus Inc.
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...xKinAnx
 
Modern Distributed Messaging and RPC
Modern Distributed Messaging and RPCModern Distributed Messaging and RPC
Modern Distributed Messaging and RPCMax Alexejev
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application clusterSatishbabu Gunukula
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil DubeySwapnil Dubey
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaAngelo Cesaro
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptxSamPrem3
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxKnoldus Inc.
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High AvailabilityJakub Pavlik
 
OpenStack HA
OpenStack HAOpenStack HA
OpenStack HAtcp cloud
 

Similar to Apache flume - an Introduction (20)

Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01
 
Introduction to Kafka Streams Presentation
Introduction to Kafka Streams PresentationIntroduction to Kafka Streams Presentation
Introduction to Kafka Streams Presentation
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Str...
 
Modern Distributed Messaging and RPC
Modern Distributed Messaging and RPCModern Distributed Messaging and RPC
Modern Distributed Messaging and RPC
 
Understand oracle real application cluster
Understand oracle real application clusterUnderstand oracle real application cluster
Understand oracle real application cluster
 
Apache flume by Swapnil Dubey
Apache flume by Swapnil DubeyApache flume by Swapnil Dubey
Apache flume by Swapnil Dubey
 
Apache storm
Apache stormApache storm
Apache storm
 
Fundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache KafkaFundamentals and Architecture of Apache Kafka
Fundamentals and Architecture of Apache Kafka
 
UNIT II DIS.pptx
UNIT II DIS.pptxUNIT II DIS.pptx
UNIT II DIS.pptx
 
Unleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptxUnleashing Real-time Power with Kafka.pptx
Unleashing Real-time Power with Kafka.pptx
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High Availability
 
OpenStack HA
OpenStack HAOpenStack HA
OpenStack HA
 
QoS, QoS Baby
QoS, QoS BabyQoS, QoS Baby
QoS, QoS Baby
 
Hadoop
HadoopHadoop
Hadoop
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 

Recently uploaded

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Apache flume - an Introduction

  • 1. APACHE FLUME Data pipelining and loading with Flume
  • 2. Agenda 1. What is Flume? 1. Overview 2. Architecture 2. Down the rabbit hole 1. Events 2. Sources 3. Channels 4. Sinks 5. Configuration 6. Data manipulation with interceptors 7. Multiplexing 8. Customer Serializers 3. Use Cases
  • 3. 1. What is Flume? • Flume is a data collection and aggregation framework, which operates in distributed topologies. • It is stream - oriented, fault tolerant and offers linear scalability. • It offers low latency with high throughput • Its configuration is declarative, yet allows easy extensibility • It is quite mature – it has been around for quite a while, enjoys support from multiple vendors with thousands of large scale deployments.
  • 4. 1. What is Flume? Flume defines a simple pipeline structure with three roles: • Source • Channel • Sinks Sources define where data comes from, e.g. a file, a message Queue (Kafka, JMS) Channels are pipes connecting Sources with Sinks Sinks are the destination of the data pipelined from Sources
  • 5. 1. What is Flume? These three roles run within a JVM process called Agents. Data is packed into Events, which are a simple Avro wrapper around any type of record, typically a line from a log file.
  • 6. 1. What is Flume? Agents can be chained into a tree structure to allow multiple collection from various sources, concentrated and piped to one destination. • Shown here: • Many agents (1 per host) at edge tier • 4 “collector” agents in that data center • 3 centralized agents writing to HDFS
  • 7. 2. Down the rabbit hole - Sources • Event-driven or polling-based • Most sources can accept batches of events • Stock source implementations: • Spool Directory • HTTP • Kafka • JMS • Netcat • AvroRPC • AvroRPC is use by Flume to communicate between Agents
  • 8. 2. Down the rabbit hole - Channels • Passive “Glue” between Sources and Sinks • Channel type determines the reliability guarantees • Stock channel types: • JDBC – has performance issues • Memory – lower latency for small writes, but not durable • File – provides durability; most people use this • New: Kafka – topics operate as channels, durable and fast
  • 9. 2. Down the rabbit hole - Sinks • All sinks are polling-based • Most sinks can process batches of events at a time • Stock sink implementations: • Kafka • HDFS • HBase (2 variants – Sinc and Async) • Solr • ElasticSearch • Avro-RPC, Thrift-RPC – Flume agent inter-connect • File Roller – write to local filesystem • Null, Logger– for testing purposes
  • 10. 2. Down the rabbit hole - Pipeline • Source: Puts events into the local channel • Channel: Store events until someone takes them • Sink: Takes events from the local channel • On failure, sinks backoff and retry forever until success
  • 11. 2. Down the rabbit hole - Configuration • Declarative configuration with property files • Define your source • Define the sink • Declare the channel for source and sink ####### Spool Config SpoolAgent.sources = MySpooler SpoolAgent.channels = MemChannel SpoolAgent.sinks = HDFS SpoolAgent.channels.MemChannel.type = memory SpoolAgent.channels.MemChannel.capacity = 500 SpoolAgent.channels.MemChannel.transactionCapacity = 200 SpoolAgent.sources.MySpooler.channels = MemChannel SpoolAgent.sources.MySpooler.type = spooldir SpoolAgent.sources.MySpooler.spoolDir = /var/log/ SpoolAgent.sources.MySpooler.fileHeader = false SpoolAgent.sinks.HDFS.channel = MemChannel SpoolAgent.sinks.HDFS.type = hdfs SpoolAgent.sinks.HDFS.hdfs.path = hdfs://cluster/data/spool/ SpoolAgent.sinks.HDFS.hdfs.fileType = DataStream SpoolAgent.sinks.HDFS.hdfs.writeFormat = Text SpoolAgent.sinks.HDFS.hdfs.batchSize = 100 SpoolAgent.sinks.HDFS.hdfs.rollSize = 0 SpoolAgent.sinks.HDFS.hdfs.rollCount = 0 SpoolAgent.sinks.HDFS.hdfs.rollInterval = 300
  • 12. 2. Down the rabbit hole - Interceptors • Incoming data can be transformed and enriched at the source • One or more interceptors can modify your events and set headers • These headers can be used for sorting within sinks and routing to different sinks • Stock Interceptor Implementations • Timestamp (adds a timestamp header) • Hostname (adds the flume host as a header) • Regex (extracts the applied regex to a header) • Static (sets a static constant header)
  • 13. 2. Down the rabbit hole - Interceptors • Custom Interceptors can be configured by implementing • org.apache.flume.interceptor.Interceptor • org.apache.flume.interceptor.Interceptor.Builder • Pass config params through flume’s Context: public void configure(Context context) { context.getString(“prop”) } Use Cases: • augment or change Event bodies, i.e. record conversion • Add headers for routing or sorting
  • 14. 2. Down the rabbit hole - Selectors • Multiplexing selectors allows header sensitive dispatch to specific destination sinks. • Replicating selectors allow simultaneous writes to multiple sinks.
  • 15. 2. Down the rabbit hole - Serializers • Serializers customize the format written to sink. • Built-in serializers are generally fine for text • More complicated use case such as Avro require customer serializers • HDFS, HBase and FileRoller Sinks support customer Serializers. • Custom Serializers implement specific interfaces / extend abstract classes: • AsyncHbaseEventSerializer (HBase) • AbstractAvroEventSerializer (HDFS) • EventSerializer (Generic)
  • 16. 2. Use Cases Flume is great for: • Continuous, sorted writing to HDFS from message queues • Log aggregation from various sources • On the fly format conversion – e.g. text -> Avro • Elastic Search loading from upstream sources • Automated Ingest from external providers Keep in mind: • Flume isn’t a stream processing framework – complex processing is a no go • Events are pipelined one to one – with some exceptions, you generally can’t create several new events from one incoming event