SlideShare a Scribd company logo
1 of 30
Download to read offline
Apache Flume
Loading Big Data into Hadoop Cluster using Flume
Swapnil Dubey
Big Data Hacker
GoDataDriven
Agenda
➢ What is Apache Flume?
➢ Problem statement
➢ Use Case : Collecting web server logs
➢ Overview/Architecture of Flume
➢ Demos
What is Flume?
Collection & Aggregation of Streaming Data
- Typically used for log data.
Advantages over other solutions:-
➢ Scalable, Reliable, Customizable
➢ Declarative and Dynamic Configuration
➢ Contextual Routing
➢ Feature Rich and Fully Extensible
➢ Open source
Problem Statement
Problem Statement
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Problem Statement
➢ Data collection is Ad hoc
➢ How to get data to Hadoop
➢ Streaming Data
Problem Statement
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Collecting web server logs.
➢ Collecting web logs using-
- Single flume agent
- Using multiple flume agents
➢ Typical converging flow
- Converging flow characteristics-Load Balancing, Multiplexing, Failover
- Large converging flows
- Event volume
Problem Statement :Single Flume Agent
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Problem Statement:Multiple Flume Agent -1
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent HDFS Write
Flume Agent
Flume Agent
HDFS Write
HDFS Write
Problem Statement:Multiple Flume Agent -2
LOGS
LOGS
LOGS
Application Servers Hadoop Cluster
Flume Agent
HDFS WriteFlume Agent
Flume Agent
Flume Agent
Overview/Architecture of Flume
Components of Flume
Events
Client
Core Concepts
➢ Events
➢ Client
➢ Agents
- Source, Channel, Sink
- Interceptor
- Channel Selector
- Sink Processor
Core Concept:Event
An Event is the basic unit of data transported by Flume from
source to destination.
➢ Payload is opaque to Flume.
➢ Events are accompanied by optional headers.
Headers:
- Headers are collection of unique Key-Value pairs
- Headers are used for contextual routing
Events
Client
Core Concept: Client
Entity that simulates event generation, passed to one or more
agents.
➢ Example: Flume log4j Appender
➢ Decouples Flume from the system where event data is generated.
Events
Client
Core Concepts: Agent
Container for hosting Sources, Channels, sinks and other
components.
Core Concepts: Source
Component that receives events and places it onto one or more
channels.
➢ Different types of sources:
- Specialized sources for integrating with well known systems.
For example -Syslog, Netcat
- Auto generating Sources-Exec,SEQ
- IPC Sources for Agent to Agent communication: Avro, Thrift
➢ Requires at least one Channel to function.
Core Concept: Channel
Component that buffers incoming events which are ultimately
consumed by Sinks.
➢ Different channels:- Memory, File, Database
➢ Channels are fully transactional.
Core Concepts: Sink
Component that takes events from channel and transmits them to
next hop destination.
Different type of Sinks:
- Terminal Sinks: HDFS,Hbase
- Auto consuming Sinks: Null Sink
- IPC sink : Agent to Agent communication-Avro, Thrift
Core Concepts:Interceptor
Interceptors are applied to sources in a predetermined fashion to
enable adding information and filtering of events.
➢ Built in Interceptors: Allows adding headers such as timestamps, static markers
etc.
➢ Custom Interceptors: Create headers by
inspecting the Event.
Channel Selector
It facilitates selection of one or more Channels, based on preset
criteria.
➢ Built in Channel Selectors:
- Replicating: for duplicating events
- Multiplexing: for routing based based on headers.
➢ Custom selectors can be written for dynamic criteria.
Sink Processor
Sink Processor is responsible for invoking one sink from a
specified group of Sinks.
➢ Built in Sink Processors:
- Load Balancing Sink Processor.
- Failover Sink Processor
- Default Sink Processor.
Data Ingest
Source
Channel
Processor
Interceptor
Channel
Selector
(decides for
channels)
Channel
Events
C
L
I
E
N
T
S
E
V
E
N
T
S
Events filtered Events
unfiltered
Events
Data Drain
➢ Event Removal from Channel is transactional.
Sink
Runner
Sink
Sink
Processor
Channels
Sink selection
n invocation
Send events
to next hop
Next Hop
Agent Pipeline
* Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-
Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
Assured Delivery
Agents use transactional exchange to guarantee
delivery across hops.
Start
Transaction
take Events
end
transaction
SinkChannel Source Channel
Start
Transaction
take Events
end
transaction
Send events
Setting up a simple agent for HDFS
agent.sources= netcat-collect
agent.sinks = hdfs-write
agent.channels= memoryChannel
agent.sources.netcat-collect.type = netcat
agent.sources.netcat-collect.bind = 127.0.0.1
agent.sources.netcat-collect.port = 11111
agent.sinks.hdfs-write.type = hdfs
agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test
agent.sinks.hdfs-write.rollInterval = 30
agent.sinks.hdfs-write.hdfs.writeFormat=Text
agent.sinks.hdfs-write.hdfs.fileType=DataStream
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity=10000
agent.sources.netcat-collect.channels=memoryChannel
agent.sinks.hdfs-write.channel=memoryChannel
Advanced Features
Fan-In and Fan-Out
hdfs-agent.channels=mchannel1 mchannel2
hdfs-agent.sources.netcat-collect.selector.type = replicating
hdfs-agent.sources.r1.channels = mchannel1 mchannel2
Interceptors
hdfs-agent.sources.netcat-collect.interceptors = filt_int
hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter
hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.*
hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true
Got BigData & Analytics work ? Contact india@GoDataDriven.
com
We are hiring!!

More Related Content

What's hot

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016Jayesh Thakrar
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexasArvind Prabhakar
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeArvind Prabhakar
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseDataWorks Summit/Hadoop Summit
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis Omid Vahdaty
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...DataWorks Summit/Hadoop Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 

What's hot (20)

ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016ApacheCon-Flume-Kafka-2016
ApacheCon-Flume-Kafka-2016
 
Apache Flume - DataDayTexas
Apache Flume - DataDayTexasApache Flume - DataDayTexas
Apache Flume - DataDayTexas
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBaseApache Phoenix and HBase: Past, Present and Future of SQL over HBase
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Introduction to streaming and messaging flume,kafka,SQS,kinesis
Introduction to streaming and messaging  flume,kafka,SQS,kinesis Introduction to streaming and messaging  flume,kafka,SQS,kinesis
Introduction to streaming and messaging flume,kafka,SQS,kinesis
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 

Viewers also liked

Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Windows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressWindows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressGeorge Kanellopoulos
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure PlatformDavid Chou
 
Cloud Powered Mobile Apps with Azure
Cloud Powered Mobile Apps  with AzureCloud Powered Mobile Apps  with Azure
Cloud Powered Mobile Apps with AzureKris Wagner
 
Search Analytics with Flume and HBase
Search Analytics with Flume and HBaseSearch Analytics with Flume and HBase
Search Analytics with Flume and HBaseSematext Group, Inc.
 
Indian natural gas market ppt
Indian natural gas market pptIndian natural gas market ppt
Indian natural gas market pptRomana Aftab
 
fog computing
fog computingfog computing
fog computingMphasis
 
Security Issues of Cloud Computing
Security Issues of Cloud ComputingSecurity Issues of Cloud Computing
Security Issues of Cloud ComputingFalgun Rathod
 
What is fog computing
What is fog computingWhat is fog computing
What is fog computingAhmed Banafa
 
Fog computing provide security to data in cloud ppt
Fog computing provide security to data in cloud pptFog computing provide security to data in cloud ppt
Fog computing provide security to data in cloud pptpriyanka reddy
 
fog computing ppt
fog computing ppt fog computing ppt
fog computing ppt sravya raju
 
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technologyNikhil Sabu
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log ProcessorCLOUDIAN KK
 
Michael Enescu - Cloud + IoT at IEEE
Michael Enescu - Cloud + IoT at IEEEMichael Enescu - Cloud + IoT at IEEE
Michael Enescu - Cloud + IoT at IEEEMichael Enescu
 

Viewers also liked (20)

Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Windows Azure & How to Deploy Wordress
Windows Azure & How to Deploy WordressWindows Azure & How to Deploy Wordress
Windows Azure & How to Deploy Wordress
 
Windows Azure Platform
Windows Azure PlatformWindows Azure Platform
Windows Azure Platform
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Flume Case Study
Flume Case StudyFlume Case Study
Flume Case Study
 
Cloud Powered Mobile Apps with Azure
Cloud Powered Mobile Apps  with AzureCloud Powered Mobile Apps  with Azure
Cloud Powered Mobile Apps with Azure
 
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study
 
Search Analytics with Flume and HBase
Search Analytics with Flume and HBaseSearch Analytics with Flume and HBase
Search Analytics with Flume and HBase
 
Indian natural gas market ppt
Indian natural gas market pptIndian natural gas market ppt
Indian natural gas market ppt
 
Data mining and_big_data_web
Data mining and_big_data_webData mining and_big_data_web
Data mining and_big_data_web
 
fog computing
fog computingfog computing
fog computing
 
Security Issues of Cloud Computing
Security Issues of Cloud ComputingSecurity Issues of Cloud Computing
Security Issues of Cloud Computing
 
What is fog computing
What is fog computingWhat is fog computing
What is fog computing
 
Fog computing provide security to data in cloud ppt
Fog computing provide security to data in cloud pptFog computing provide security to data in cloud ppt
Fog computing provide security to data in cloud ppt
 
fog computing ppt
fog computing ppt fog computing ppt
fog computing ppt
 
Fog computing technology
Fog computing technologyFog computing technology
Fog computing technology
 
Fog computing
Fog computingFog computing
Fog computing
 
Flume-Cassandra Log Processor
Flume-Cassandra Log ProcessorFlume-Cassandra Log Processor
Flume-Cassandra Log Processor
 
Michael Enescu - Cloud + IoT at IEEE
Michael Enescu - Cloud + IoT at IEEEMichael Enescu - Cloud + IoT at IEEE
Michael Enescu - Cloud + IoT at IEEE
 

Similar to Apache flume by Swapnil Dubey

Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptxJayesh Patil
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01joahp
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive IntroductionHanborq Inc.
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Cloudera, Inc.
 
Experiences with Microservices at Tuenti
Experiences with Microservices at TuentiExperiences with Microservices at Tuenti
Experiences with Microservices at TuentiAndrés Viedma Peláez
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkooFabrice dos Santos
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an IntroductionErik Schmiegelow
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Pat Patterson
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataNaveen Korakoppa
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Module: Mutable Content in IPFS
Module: Mutable Content in IPFSModule: Mutable Content in IPFS
Module: Mutable Content in IPFSIoannis Psaras
 
21 Www Web Services
21 Www Web Services21 Www Web Services
21 Www Web Servicesroyans
 
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleuls
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleulsapidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleuls
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleulsapidays
 
05.m3 cms list-ofwebserver
05.m3 cms list-ofwebserver05.m3 cms list-ofwebserver
05.m3 cms list-ofwebservertarensi
 
Edge Side APIs: Fast and Reliable Hypermedia APIs
Edge Side APIs: Fast and Reliable Hypermedia APIsEdge Side APIs: Fast and Reliable Hypermedia APIs
Edge Side APIs: Fast and Reliable Hypermedia APIsNordic APIs
 

Similar to Apache flume by Swapnil Dubey (20)

Flume
FlumeFlume
Flume
 
Flume DS -JSP.pptx
Flume DS -JSP.pptxFlume DS -JSP.pptx
Flume DS -JSP.pptx
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01Flume lspe-110325145754-phpapp01
Flume lspe-110325145754-phpapp01
 
Flume and Flive Introduction
Flume and Flive IntroductionFlume and Flive Introduction
Flume and Flive Introduction
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
 
Experiences with Microservices at Tuenti
Experiences with Microservices at TuentiExperiences with Microservices at Tuenti
Experiences with Microservices at Tuenti
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 
Apache flume - an Introduction
Apache flume - an IntroductionApache flume - an Introduction
Apache flume - an Introduction
 
Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!Open Source Big Data Ingestion - Without the Heartburn!
Open Source Big Data Ingestion - Without the Heartburn!
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Flume basic
Flume basicFlume basic
Flume basic
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Module: Mutable Content in IPFS
Module: Mutable Content in IPFSModule: Mutable Content in IPFS
Module: Mutable Content in IPFS
 
21 Www Web Services
21 Www Web Services21 Www Web Services
21 Www Web Services
 
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleuls
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleulsapidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleuls
apidays LIVE Paris 2021 - Edge Side APIs by Kevin Dunglas, Les Tilleuls
 
Flume and HBase
Flume and HBase Flume and HBase
Flume and HBase
 
05.m3 cms list-ofwebserver
05.m3 cms list-ofwebserver05.m3 cms list-ofwebserver
05.m3 cms list-ofwebserver
 
Edge Side APIs: Fast and Reliable Hypermedia APIs
Edge Side APIs: Fast and Reliable Hypermedia APIsEdge Side APIs: Fast and Reliable Hypermedia APIs
Edge Side APIs: Fast and Reliable Hypermedia APIs
 

Recently uploaded

Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 

Recently uploaded (20)

Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Apache flume by Swapnil Dubey

  • 1. Apache Flume Loading Big Data into Hadoop Cluster using Flume Swapnil Dubey Big Data Hacker GoDataDriven
  • 2. Agenda ➢ What is Apache Flume? ➢ Problem statement ➢ Use Case : Collecting web server logs ➢ Overview/Architecture of Flume ➢ Demos
  • 3. What is Flume? Collection & Aggregation of Streaming Data - Typically used for log data. Advantages over other solutions:- ➢ Scalable, Reliable, Customizable ➢ Declarative and Dynamic Configuration ➢ Contextual Routing ➢ Feature Rich and Fully Extensible ➢ Open source
  • 6. Problem Statement ➢ Data collection is Ad hoc ➢ How to get data to Hadoop ➢ Streaming Data
  • 7. Problem Statement LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write
  • 8. Collecting web server logs. ➢ Collecting web logs using- - Single flume agent - Using multiple flume agents ➢ Typical converging flow - Converging flow characteristics-Load Balancing, Multiplexing, Failover - Large converging flows - Event volume
  • 9. Problem Statement :Single Flume Agent LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write
  • 10. Problem Statement:Multiple Flume Agent -1 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS Write Flume Agent Flume Agent HDFS Write HDFS Write
  • 11. Problem Statement:Multiple Flume Agent -2 LOGS LOGS LOGS Application Servers Hadoop Cluster Flume Agent HDFS WriteFlume Agent Flume Agent Flume Agent
  • 14. Core Concepts ➢ Events ➢ Client ➢ Agents - Source, Channel, Sink - Interceptor - Channel Selector - Sink Processor
  • 15. Core Concept:Event An Event is the basic unit of data transported by Flume from source to destination. ➢ Payload is opaque to Flume. ➢ Events are accompanied by optional headers. Headers: - Headers are collection of unique Key-Value pairs - Headers are used for contextual routing Events Client
  • 16. Core Concept: Client Entity that simulates event generation, passed to one or more agents. ➢ Example: Flume log4j Appender ➢ Decouples Flume from the system where event data is generated. Events Client
  • 17. Core Concepts: Agent Container for hosting Sources, Channels, sinks and other components.
  • 18. Core Concepts: Source Component that receives events and places it onto one or more channels. ➢ Different types of sources: - Specialized sources for integrating with well known systems. For example -Syslog, Netcat - Auto generating Sources-Exec,SEQ - IPC Sources for Agent to Agent communication: Avro, Thrift ➢ Requires at least one Channel to function.
  • 19. Core Concept: Channel Component that buffers incoming events which are ultimately consumed by Sinks. ➢ Different channels:- Memory, File, Database ➢ Channels are fully transactional.
  • 20. Core Concepts: Sink Component that takes events from channel and transmits them to next hop destination. Different type of Sinks: - Terminal Sinks: HDFS,Hbase - Auto consuming Sinks: Null Sink - IPC sink : Agent to Agent communication-Avro, Thrift
  • 21. Core Concepts:Interceptor Interceptors are applied to sources in a predetermined fashion to enable adding information and filtering of events. ➢ Built in Interceptors: Allows adding headers such as timestamps, static markers etc. ➢ Custom Interceptors: Create headers by inspecting the Event.
  • 22. Channel Selector It facilitates selection of one or more Channels, based on preset criteria. ➢ Built in Channel Selectors: - Replicating: for duplicating events - Multiplexing: for routing based based on headers. ➢ Custom selectors can be written for dynamic criteria.
  • 23. Sink Processor Sink Processor is responsible for invoking one sink from a specified group of Sinks. ➢ Built in Sink Processors: - Load Balancing Sink Processor. - Failover Sink Processor - Default Sink Processor.
  • 25. Data Drain ➢ Event Removal from Channel is transactional. Sink Runner Sink Sink Processor Channels Sink selection n invocation Send events to next hop Next Hop
  • 26. Agent Pipeline * Credits: http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data- Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
  • 27. Assured Delivery Agents use transactional exchange to guarantee delivery across hops. Start Transaction take Events end transaction SinkChannel Source Channel Start Transaction take Events end transaction Send events
  • 28. Setting up a simple agent for HDFS agent.sources= netcat-collect agent.sinks = hdfs-write agent.channels= memoryChannel agent.sources.netcat-collect.type = netcat agent.sources.netcat-collect.bind = 127.0.0.1 agent.sources.netcat-collect.port = 11111 agent.sinks.hdfs-write.type = hdfs agent.sinks.hdfs-write.hdfs.path = hdfs://namenode_address:8020/path/to/flume_test agent.sinks.hdfs-write.rollInterval = 30 agent.sinks.hdfs-write.hdfs.writeFormat=Text agent.sinks.hdfs-write.hdfs.fileType=DataStream agent.channels.memoryChannel.type = memory agent.channels.memoryChannel.capacity=10000 agent.sources.netcat-collect.channels=memoryChannel agent.sinks.hdfs-write.channel=memoryChannel
  • 29. Advanced Features Fan-In and Fan-Out hdfs-agent.channels=mchannel1 mchannel2 hdfs-agent.sources.netcat-collect.selector.type = replicating hdfs-agent.sources.r1.channels = mchannel1 mchannel2 Interceptors hdfs-agent.sources.netcat-collect.interceptors = filt_int hdfs-agent.sources.netcat-collect.interceptors.filt_int.type=regex_filter hdfs-agent.sources.netcat-collect.interceptors.filt_int.regex=^echo.* hdfs-agent.sources.netcat-collect.interceptors.filt_int.excludeEvents=true
  • 30. Got BigData & Analytics work ? Contact india@GoDataDriven. com We are hiring!!