SlideShare a Scribd company logo
1 of 19
Download to read offline
Arinto Murdopo
 Josep Subirats
       Group 4
     EEDC 2012
Outline
● Current problem
● What is Apache Flume?
● The Flume Model
  ○ Flows and Nodes
  ○ Agent, Processor and Collector Nodes
  ○ Data and Control Path
● Flume goals
  ○ Reliability
  ○ Scalability
  ○ Extensibility
  ○ Manageability
● Use case: Near Realtime Aggregator
 
Current Problem
● Situation:
You have hundreds of services running in different servers
that produce lots of large logs which should be analyzed
altogether. You have Hadoop to process them.
 
● Problem:
How do I send all my logs to a place that has Hadoop? I
need a reliable, scalable, extensible and manageable way
to do it!
What is Apache Flume?
● It is a distributed data collection service that gets
    flows of data (like logs) from their source and
    aggregates them to where they have to be processed.
●   Goals: reliability, scalability, extensibility,
    manageability.




                   Exactly what I needed!
The Flume Model: Flows and Nodes

● A flow corresponds to a type of data source (server
    logs, machine monitoring metrics...).
●   Flows are comprised of nodes chained together (see
    slide 7).
The Flume Model: Flows and Nodes
● In a Node, data come in through a source...
   ...are optionally processed by one or more decorators...
   ...and then are transmitted out via a sink.
    
                 Examples: Console, Exec, Syslog, IRC,
                 Twitter, other nodes...
                  
                 Examples: Console, local files, HDFS, S3,
                 other nodes...
                  
                 Examples: wire batching, compression,
                 sampling, projection, extraction...
The Flume Model: Agent, Processor and
Collector Nodes

● Agent:
    receives data from an
    application.
 
● Processor (optional):
    intermediate processing.
 
● Collector:
    write data to permanent
    storage.
The Flume Model: Data and Control
Path (1/2)
Nodes are in the data path.
The Flume Model: Data and Control
Path (2/2)
Masters are in the control path.
● Centralized point of configuration. Multiple: ZK.
● Specify sources, sinks and control data flows.
Flume Goals: Reliability
Tunable Failure Recovery Modes
 
● Best Effort
 
● Store on Failure and Retry
 
● End to End Reliability
Flume Goals: Scalability
Horizontally Scalable Data Path




Load Balancing
Flume Goals: Scalability
Horizontally Scalable Control Path
Flume Goals: Extensibility
● Simple Source and Sink API
  ○ Event streaming and composition of simple
       operation
   
● Plug in Architecture
   ○ Add your own sources, sinks, decorators
    
    
Flume Goals: Manageability
Centralized Data Flow Management Interface
 
Flume Goals: Manageability
Configuring Flume
 
 
   Node: tail(“file”) | filter [ console, roll
   (1000) { dfs(“hdfs://namenode/user/flume”) } ]
   ;
Output Bucketing
                              /logs/web/2010/0715/1200/data-xxx.txt
                              /logs/web/2010/0715/1200/data-xxy.txt
                              /logs/web/2010/0715/1300/data-xxx.txt
                              /logs/web/2010/0715/1300/data-xxy.txt
                              /logs/web/2010/0715/1400/data-xxx.txt
Use Case: Near Realtime Aggregator
Conclusion
Flume is
● Distributed data collection service
 
● Suitable for enterprise setting
 
● Large amount of log data to process
Q&A
Questions to be unveiled?
 
 
References
●   http://www.cloudera.
    com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie
    h_hadoop_log_processing/
●   http://www.slideshare.net/cloudera/inside-flume
●   http://www.slideshare.net/cloudera/flume-intro100715
●   http://www.slideshare.net/cloudera/flume-austin-hug-21711

More Related Content

What's hot

Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 

What's hot (20)

Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Similar to Apache Flume

Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
Mário Almeida
 

Similar to Apache Flume (20)

Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Near real-time anomaly detection at Lyft
Near real-time anomaly detection at LyftNear real-time anomaly detection at Lyft
Near real-time anomaly detection at Lyft
 
Monitoring.pptx
Monitoring.pptxMonitoring.pptx
Monitoring.pptx
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
FluentD for end to end monitoring
FluentD for end to end monitoringFluentD for end to end monitoring
FluentD for end to end monitoring
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!Do you know what your Drupal is doing_ Observe it!
Do you know what your Drupal is doing_ Observe it!
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
Greenplum for Internet Scale Analytics and Mining - Greenplum Summit 2018
 
Data Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache FlumeData Aggregation At Scale Using Apache Flume
Data Aggregation At Scale Using Apache Flume
 
NodeJS
NodeJSNodeJS
NodeJS
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Graphs, parallelism and business cases
 Graphs, parallelism and business cases Graphs, parallelism and business cases
Graphs, parallelism and business cases
 
Graphs, parallelism and business cases
Graphs, parallelism and business casesGraphs, parallelism and business cases
Graphs, parallelism and business cases
 
The Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedInThe Big Data Analytics Ecosystem at LinkedIn
The Big Data Analytics Ecosystem at LinkedIn
 
Empowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data StreamingEmpowering Real-Time Decision Making with Data Streaming
Empowering Real-Time Decision Making with Data Streaming
 
Setting up a big data platform at kelkoo
Setting up a big data platform at kelkooSetting up a big data platform at kelkoo
Setting up a big data platform at kelkoo
 

More from Arinto Murdopo

More from Arinto Murdopo (20)

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Apache Flume

  • 1. Arinto Murdopo Josep Subirats Group 4 EEDC 2012
  • 2. Outline ● Current problem ● What is Apache Flume? ● The Flume Model ○ Flows and Nodes ○ Agent, Processor and Collector Nodes ○ Data and Control Path ● Flume goals ○ Reliability ○ Scalability ○ Extensibility ○ Manageability ● Use case: Near Realtime Aggregator  
  • 3. Current Problem ● Situation: You have hundreds of services running in different servers that produce lots of large logs which should be analyzed altogether. You have Hadoop to process them.   ● Problem: How do I send all my logs to a place that has Hadoop? I need a reliable, scalable, extensible and manageable way to do it!
  • 4. What is Apache Flume? ● It is a distributed data collection service that gets flows of data (like logs) from their source and aggregates them to where they have to be processed. ● Goals: reliability, scalability, extensibility, manageability. Exactly what I needed!
  • 5. The Flume Model: Flows and Nodes ● A flow corresponds to a type of data source (server logs, machine monitoring metrics...). ● Flows are comprised of nodes chained together (see slide 7).
  • 6. The Flume Model: Flows and Nodes ● In a Node, data come in through a source... ...are optionally processed by one or more decorators... ...and then are transmitted out via a sink.   Examples: Console, Exec, Syslog, IRC, Twitter, other nodes...   Examples: Console, local files, HDFS, S3, other nodes...   Examples: wire batching, compression, sampling, projection, extraction...
  • 7. The Flume Model: Agent, Processor and Collector Nodes ● Agent: receives data from an application.   ● Processor (optional): intermediate processing.   ● Collector: write data to permanent storage.
  • 8. The Flume Model: Data and Control Path (1/2) Nodes are in the data path.
  • 9. The Flume Model: Data and Control Path (2/2) Masters are in the control path. ● Centralized point of configuration. Multiple: ZK. ● Specify sources, sinks and control data flows.
  • 10. Flume Goals: Reliability Tunable Failure Recovery Modes   ● Best Effort   ● Store on Failure and Retry   ● End to End Reliability
  • 11. Flume Goals: Scalability Horizontally Scalable Data Path Load Balancing
  • 12. Flume Goals: Scalability Horizontally Scalable Control Path
  • 13. Flume Goals: Extensibility ● Simple Source and Sink API ○ Event streaming and composition of simple operation   ● Plug in Architecture ○ Add your own sources, sinks, decorators    
  • 14. Flume Goals: Manageability Centralized Data Flow Management Interface  
  • 15. Flume Goals: Manageability Configuring Flume     Node: tail(“file”) | filter [ console, roll (1000) { dfs(“hdfs://namenode/user/flume”) } ] ; Output Bucketing   /logs/web/2010/0715/1200/data-xxx.txt /logs/web/2010/0715/1200/data-xxy.txt /logs/web/2010/0715/1300/data-xxx.txt   /logs/web/2010/0715/1300/data-xxy.txt /logs/web/2010/0715/1400/data-xxx.txt
  • 16. Use Case: Near Realtime Aggregator
  • 17. Conclusion Flume is ● Distributed data collection service   ● Suitable for enterprise setting   ● Large amount of log data to process
  • 18. Q&A Questions to be unveiled?    
  • 19. References ● http://www.cloudera. com/resource/chicago_data_summit_flume_an_introduction_jonathan_hsie h_hadoop_log_processing/ ● http://www.slideshare.net/cloudera/inside-flume ● http://www.slideshare.net/cloudera/flume-intro100715 ● http://www.slideshare.net/cloudera/flume-austin-hug-21711