Flume is an open-source distributed system for collecting, aggregating, and moving large amounts of log data efficiently. It has a simple extensible data model and can ingest data from many different sources, process data using customizable handlers, and deliver aggregated data to various destinations such as HDFS, HBase, Elasticsearch etc. Flume provides reliable service even in the presence of failures in the system with its ability to recover from failures and handle failovers.
Flume is a distributed system for reliably collecting logs from different sources and transporting them to centralized data stores. It provides mechanisms for handling failures and scaling to large volumes of log data. Key components include agents that collect logs from various sources, collectors that receive logs from agents and pass them to storage, and masters that manage the configuration. Flume uses a decentralized control plane and provides tunable reliability levels to ensure logs are not lost during failures. It can scale horizontally by adding more agents, collectors, and masters as needed to handle increasing data volumes.
Chicago Data Summit: Flume: An IntroductionCloudera, Inc.
Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.
Apache Flume is a distributed system for efficiently collecting large streams of log data into Hadoop. It has a simple architecture based on streaming data flows between sources, sinks, and channels. An agent contains a source that collects data, a channel that buffers the data, and a sink that stores it. This document demonstrates how to install Flume, configure it to collect tweets from Twitter using the Twitter streaming API, and save the tweets to HDFS.
Apache Flume is a tool for collecting large amounts of streaming data from various sources and transporting it to a centralized data store like HDFS. It reliably delivers events from multiple data sources to destinations such as HDFS or HBase. Flume uses a simple and flexible architecture based on streaming data flows, with reliable delivery of events guaranteed through a system of agents, channels, and sinks.
This document provides an overview of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It describes the core concepts in Flume including events, clients, agents, sources, channels, and sinks. Sources are components that read data and pass it to channels. Channels buffer events and sinks remove events from channels and transmit them to their destination. The document discusses commonly used source, channel and sink types and provides examples of Flume flows.
This document outlines Apache Flume, a distributed system for collecting large amounts of log data from various sources and transporting it to a centralized data store such as Hadoop. It describes the key components of Flume including agents, sources, sinks and flows. It explains how Flume provides reliable, scalable, extensible and manageable log aggregation capabilities through its node-based architecture and horizontal scalability. An example use case of using Flume for near real-time log aggregation is also briefly mentioned.
Apache Flume is a system for collecting streaming data from various sources and transporting it to destinations such as HDFS or HBase. It has a configurable architecture that allows data to flow from clients to sinks via channels. Sources produce events that are sent to channels, which then deliver the events to sinks. Flume agents run sources and sinks in a configurable topology to provide reliable data transport.
Flume is a distributed system for reliably collecting logs from different sources and transporting them to centralized data stores. It provides mechanisms for handling failures and scaling to large volumes of log data. Key components include agents that collect logs from various sources, collectors that receive logs from agents and pass them to storage, and masters that manage the configuration. Flume uses a decentralized control plane and provides tunable reliability levels to ensure logs are not lost during failures. It can scale horizontally by adding more agents, collectors, and masters as needed to handle increasing data volumes.
Chicago Data Summit: Flume: An IntroductionCloudera, Inc.
Flume is an open-source, distributed, streaming log collection system designed for ingesting large quantities of data into large-scale data storage and analytics platforms such as Apache Hadoop. It has four goals in mind: Reliability, Scalability, Extensibility, and Manageability. Its horizontal scalable architecture offers fault-tolerant end-to-end delivery guarantees, support for low-latency event processing, provides a centralized management interface , and exposes metrics for ingest monitoring and reporting. It natively supports writing data to Hadoop's HDFS but also has a simple extension interface that allows it to write to other scalable data systems such as low-latency datastores or incremental search indexers.
Apache Flume is a distributed system for efficiently collecting large streams of log data into Hadoop. It has a simple architecture based on streaming data flows between sources, sinks, and channels. An agent contains a source that collects data, a channel that buffers the data, and a sink that stores it. This document demonstrates how to install Flume, configure it to collect tweets from Twitter using the Twitter streaming API, and save the tweets to HDFS.
Apache Flume is a tool for collecting large amounts of streaming data from various sources and transporting it to a centralized data store like HDFS. It reliably delivers events from multiple data sources to destinations such as HDFS or HBase. Flume uses a simple and flexible architecture based on streaming data flows, with reliable delivery of events guaranteed through a system of agents, channels, and sinks.
This document provides an overview of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store. It describes the core concepts in Flume including events, clients, agents, sources, channels, and sinks. Sources are components that read data and pass it to channels. Channels buffer events and sinks remove events from channels and transmit them to their destination. The document discusses commonly used source, channel and sink types and provides examples of Flume flows.
This document outlines Apache Flume, a distributed system for collecting large amounts of log data from various sources and transporting it to a centralized data store such as Hadoop. It describes the key components of Flume including agents, sources, sinks and flows. It explains how Flume provides reliable, scalable, extensible and manageable log aggregation capabilities through its node-based architecture and horizontal scalability. An example use case of using Flume for near real-time log aggregation is also briefly mentioned.
Apache Flume is a system for collecting streaming data from various sources and transporting it to destinations such as HDFS or HBase. It has a configurable architecture that allows data to flow from clients to sinks via channels. Sources produce events that are sent to channels, which then deliver the events to sinks. Flume agents run sources and sinks in a configurable topology to provide reliable data transport.
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
Apache Flume is a distributed system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store such as Hadoop Distributed File System (HDFS). It consists of agents that collect data from sources and deliver it to sinks using channels. Common sources include log files, Kafka streams, and Avro clients. Common sinks include HDFS, HBase, Elasticsearch, and Kafka. Flume provides reliable and available service for efficiently collecting and moving large amounts of log data.
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Flume NG is a tool for collecting and moving large amounts of log data from distributed servers to a Hadoop cluster. It uses agents that collect data through sources like netcat, store data temporarily in channels like memory, and then write data to sinks like HDFS. Flume provides reliable data transport through its use of transactions and flexible configuration through sources, channels, and sinks.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
The document describes using Apache Flume to collect log data from machines in a manufacturing process. It proposes setting up Flume agents on each machine that generate log files and forwarding the data to a central HDFS server. The author tests a sample Flume configuration with two virtual machines generating logs and an agent transferring the data to an HDFS directory. Next steps discussed are analyzing the log data using tools like MapReduce, Hive, and Mahout and visualizing it to improve quality control and production processes.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
This document discusses filesystems, RPC, HDFS, and I/O schedulers. It provides an overview of Linux kernel I/O schedulers and how they optimize disk access. It then discusses the I/O stack in Linux, including the virtual filesystem (VFS) layer. It describes the NFS client-server model using RPC over TCP/IP and how HDFS uses a similar model with its own APIs. Finally, it outlines the write process in HDFS from the client to data nodes.
This document introduces Flume and Flive. It summarizes that Flume is a distributed data collection system that can easily extend to new data formats and scales linearly as new nodes are added. It discusses Flume's core concepts of events, flows, nodes, and reliability features. It then introduces Flive, an enhanced version of Flume developed by Hanborq that provides improved performance, functionality, manageability, and integration with Hugetable.
Flume is a system for collecting, aggregating, and moving large amounts of streaming data into Hadoop. It has reliable, customizable components like sources that generate or collect event data, channels that buffer events, and sinks that ship events to destinations. Sources put events into channels, which decouple sources from sinks and provide reliability. Sinks remove events from channels and transmit them to their final destination. Flume ensures reliable event delivery through transactional channel operations and persistence. It also provides load balancing, failover, and contextual routing capabilities through interceptors, channel selectors, and sink processors.
This document provides an overview of effective HBase health checking and troubleshooting. It discusses HBase architecture including the roles of the master, regionservers and Zookeeper. It then describes various tools and utilities for troubleshooting like the master and regionserver UIs, process logs, JMX stats, the HBase shell, HBCK and performance evaluation tools. It also covers common problems like HBase not serving data or abrupt regionserver restarts and provides steps to troubleshoot these issues.
Sqoop is a tool for transferring bulk data between Hadoop and structured datastores like relational databases. This document compares Sqoop 1 and the new Sqoop 2 architecture. Sqoop 2 addresses limitations of Sqoop 1 by providing a server-side installation, explicit connector selection, common functionality for all connectors, and role-based security for accessing external systems and managing resources. The document highlights improved ease of use, extension, and security in Sqoop 2 compared to the client-side tool design of Sqoop 1.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
This document summarizes different approaches to achieving high availability in HBase. It discusses HBase region replicas, asynchronous WAL replication, and timeline consistency introduced in HBase 1.1 to provide increased read availability. It also describes WanDisco's non-stop HBase implementation using Paxos consensus and Facebook's HydraBase which uses RAFT consensus, with HydraBase designating region replicas as active or witness. The document compares these approaches on attributes like consensus algorithm, read/write availability, strong consistency, and support for multi-datacenter deployments.
This document provides an overview of HP Shadowbase, a data replication software. It discusses Shadowbase's capabilities including replication of data across various database platforms, zero downtime migrations, management utilities, and attractive costs. It also outlines the Shadowbase product portfolio and supported platforms/databases. Finally, it describes Shadowbase's architectural components and how data is captured from source environments and delivered to target databases.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
This document discusses using Apache Flume to collect streaming Twitter data. Flume is a distributed service that can efficiently collect, aggregate, and move large amounts of log and streaming data to Hadoop. It describes how to set up a Flume agent to capture Twitter data using keywords and send it to HDFS. The agent uses a Twitter source, memory channel, and HDFS sink. It also discusses including additional Flume dependencies and configuring Flume and Hadoop to integrate Twitter data collection and storage in HDFS.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
In this session you will learn:
Flume Overview
Flume Agent
Sinks
Flume Installation
What is Netcat & Telnet?
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
This document provides an overview of Flume, an open source distributed system for collecting, aggregating, and moving large amounts of log data efficiently. Flume allows for reliable log collection from hundreds of services and nodes. It uses a scalable data flow model where data sources like log files are ingested from nodes as streams and delivered to destinations like HDFS. The document outlines Flume's goals of reliability, scalability, extensibility and manageability. It describes key components like agents, collectors and masters, and how the data and control planes can horizontally scale out through the addition of nodes. Reliability is achieved through tunable data acknowledgment levels.
This document provides an overview of Apache Flume and how it can be used to load streaming data into a Hadoop cluster. It describes Flume's core components like sources, channels, sinks and how they work together in an agent. It also gives examples of using a single Flume agent and multiple agents to collect web server logs. Advanced features like interceptors, fan-in/fan-out are also briefly covered along with a simple configuration example to ingest data into HDFS.
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
Apache Flume is a distributed system for efficiently collecting, aggregating, and moving large amounts of log data from many different sources to a centralized data store such as Hadoop Distributed File System (HDFS). It consists of agents that collect data from sources and deliver it to sinks using channels. Common sources include log files, Kafka streams, and Avro clients. Common sinks include HDFS, HBase, Elasticsearch, and Kafka. Flume provides reliable and available service for efficiently collecting and moving large amounts of log data.
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Flume NG is a tool for collecting and moving large amounts of log data from distributed servers to a Hadoop cluster. It uses agents that collect data through sources like netcat, store data temporarily in channels like memory, and then write data to sinks like HDFS. Flume provides reliable data transport through its use of transactions and flexible configuration through sources, channels, and sinks.
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
The document describes using Apache Flume to collect log data from machines in a manufacturing process. It proposes setting up Flume agents on each machine that generate log files and forwarding the data to a central HDFS server. The author tests a sample Flume configuration with two virtual machines generating logs and an agent transferring the data to an HDFS directory. Next steps discussed are analyzing the log data using tools like MapReduce, Hive, and Mahout and visualizing it to improve quality control and production processes.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
This document discusses filesystems, RPC, HDFS, and I/O schedulers. It provides an overview of Linux kernel I/O schedulers and how they optimize disk access. It then discusses the I/O stack in Linux, including the virtual filesystem (VFS) layer. It describes the NFS client-server model using RPC over TCP/IP and how HDFS uses a similar model with its own APIs. Finally, it outlines the write process in HDFS from the client to data nodes.
This document introduces Flume and Flive. It summarizes that Flume is a distributed data collection system that can easily extend to new data formats and scales linearly as new nodes are added. It discusses Flume's core concepts of events, flows, nodes, and reliability features. It then introduces Flive, an enhanced version of Flume developed by Hanborq that provides improved performance, functionality, manageability, and integration with Hugetable.
Flume is a system for collecting, aggregating, and moving large amounts of streaming data into Hadoop. It has reliable, customizable components like sources that generate or collect event data, channels that buffer events, and sinks that ship events to destinations. Sources put events into channels, which decouple sources from sinks and provide reliability. Sinks remove events from channels and transmit them to their final destination. Flume ensures reliable event delivery through transactional channel operations and persistence. It also provides load balancing, failover, and contextual routing capabilities through interceptors, channel selectors, and sink processors.
This document provides an overview of effective HBase health checking and troubleshooting. It discusses HBase architecture including the roles of the master, regionservers and Zookeeper. It then describes various tools and utilities for troubleshooting like the master and regionserver UIs, process logs, JMX stats, the HBase shell, HBCK and performance evaluation tools. It also covers common problems like HBase not serving data or abrupt regionserver restarts and provides steps to troubleshoot these issues.
Sqoop is a tool for transferring bulk data between Hadoop and structured datastores like relational databases. This document compares Sqoop 1 and the new Sqoop 2 architecture. Sqoop 2 addresses limitations of Sqoop 1 by providing a server-side installation, explicit connector selection, common functionality for all connectors, and role-based security for accessing external systems and managing resources. The document highlights improved ease of use, extension, and security in Sqoop 2 compared to the client-side tool design of Sqoop 1.
In this session you will learn:
1. Kafka Overview
2. Need for Kafka
3. Kafka Architecture
4. Kafka Components
5. ZooKeeper Overview
6. Leader Node
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
This document summarizes different approaches to achieving high availability in HBase. It discusses HBase region replicas, asynchronous WAL replication, and timeline consistency introduced in HBase 1.1 to provide increased read availability. It also describes WanDisco's non-stop HBase implementation using Paxos consensus and Facebook's HydraBase which uses RAFT consensus, with HydraBase designating region replicas as active or witness. The document compares these approaches on attributes like consensus algorithm, read/write availability, strong consistency, and support for multi-datacenter deployments.
This document provides an overview of HP Shadowbase, a data replication software. It discusses Shadowbase's capabilities including replication of data across various database platforms, zero downtime migrations, management utilities, and attractive costs. It also outlines the Shadowbase product portfolio and supported platforms/databases. Finally, it describes Shadowbase's architectural components and how data is captured from source environments and delivered to target databases.
HBase Read High Availability Using Timeline Consistent Region Replicasenissoz
This document summarizes a talk on implementing timeline consistency for HBase region replicas. It introduces the concept of region replicas, where each region has multiple copies hosted on different servers. The primary accepts writes, while secondary replicas are read-only. Reads from secondaries return possibly stale data. The talk outlines the implementation of region replicas in HBase, including updates to the master, region servers, and IPC. It discusses data replication approaches and next steps to implement write replication using the write-ahead log. The goal is to provide high availability for reads in HBase while tolerating single-server failures.
- The document summarizes the state of Apache HBase, including recent releases, compatibility between versions, and new developments.
- Key releases include HBase 1.1, 1.2, and 1.3, which added features like async RPC client, scan improvements, and date-tiered compaction. HBase 2.0 is targeting compatibility improvements and major changes to data layout and assignment.
- New developments include date-tiered compaction for time series data, Spark integration, and ongoing work on async operations, replication 2.0, and reducing garbage collection overhead.
The document discusses best practices for operating and supporting Apache HBase. It outlines tools like the HBase UI and HBCK that can be used to debug issues. The top categories of issues covered are region server stability problems, read/write performance, and inconsistencies. SmartSense is introduced as a tool that can help detect configuration issues proactively.
This document discusses using Apache Flume to collect streaming Twitter data. Flume is a distributed service that can efficiently collect, aggregate, and move large amounts of log and streaming data to Hadoop. It describes how to set up a Flume agent to capture Twitter data using keywords and send it to HDFS. The agent uses a Twitter source, memory channel, and HDFS sink. It also discusses including additional Flume dependencies and configuring Flume and Hadoop to integrate Twitter data collection and storage in HDFS.
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
In this session you will learn:
Flume Overview
Flume Agent
Sinks
Flume Installation
What is Netcat & Telnet?
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
This document provides an overview of Flume, an open source distributed system for collecting, aggregating, and moving large amounts of log data efficiently. Flume allows for reliable log collection from hundreds of services and nodes. It uses a scalable data flow model where data sources like log files are ingested from nodes as streams and delivered to destinations like HDFS. The document outlines Flume's goals of reliability, scalability, extensibility and manageability. It describes key components like agents, collectors and masters, and how the data and control planes can horizontally scale out through the addition of nodes. Reliability is achieved through tunable data acknowledgment levels.
This document provides an overview of Apache Flume and how it can be used to load streaming data into a Hadoop cluster. It describes Flume's core components like sources, channels, sinks and how they work together in an agent. It also gives examples of using a single Flume agent and multiple agents to collect web server logs. Advanced features like interceptors, fan-in/fan-out are also briefly covered along with a simple configuration example to ingest data into HDFS.
This document discusses Apache Flume and using its HBase sink. It provides an overview of Flume's architecture, components, and sources. It describes how the HBase sink works, its configuration options, and provides an example configuration for collecting data from a sequential source and storing it in an HBase table.
Hadoop security has improved with additions such as HDFS ACLs, Hive column-level ACLs, HBase cell-level ACLs, and Knox for perimeter security. Data encryption has also been enhanced, with support for encrypting data in transit using SSL and data at rest through file encryption or the upcoming native HDFS encryption. Authentication is provided by Kerberos/AD with token-based authorization, and auditing tracks who accessed what data.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It has several core components including HDFS for distributed file storage and MapReduce for distributed processing. HDFS stores data across clusters of machines with replication for fault tolerance. MapReduce allows parallel processing of large datasets in a distributed manner. Hadoop was designed with goals of using commodity hardware, easy recovery from failures, large distributed file systems, and fast processing of large datasets.
This document summarizes key aspects of the Hadoop Distributed File System (HDFS). HDFS is designed for storing very large files across commodity hardware. It uses a master/slave architecture with a single NameNode that manages file system metadata and multiple DataNodes that store application data. HDFS allows for streaming access to this distributed data and can provide higher throughput than a single high-end server by parallelizing reads across nodes.
Henry Robinson works at Cloudera on distributed data collection tools like Flume and ZooKeeper. Cloudera provides support for Hadoop and open source projects like Flume. Flume is a scalable and configurable system for collecting large amounts of log and event data into Hadoop from diverse sources. It allows defining flexible data flows that can reliably move data between collection agents and storage systems.
Gluster Webinar: Introduction to GlusterFSGlusterFS
GlusterFS is an open source, scale-out network filesystem. It runs on commodity hardware and allows indefinite growth in capacity and performance by simply adding server nodes. Key benefits include flexibility to deploy on any hardware, linearly scalable performance, and superior storage economics compared to traditional storage solutions. GlusterFS uses a distributed hashing technique instead of a metadata server to provide high availability and reliability.
Presentation on 2013-06-27, Workshop on the future of Big Data management, discussing hadoop for a science audience that are either HPC/grid users or people suddenly discovering that their data is accruing towards PB.
The other talks were on GPFS, LustreFS and Ceph, so rather than just do beauty-contest slides, I decided to raise the question of "what is a filesystem?", whether the constraints imposed by the Unix metaphor and API are becoming limits on scale and parallelism (both technically and, for GPFS and Lustre Enterprise in cost).
Then: HDFS as the foundation for the Hadoop stack.
All the other FS talks did emphasise their Hadoop integration, with the Intel talk doing the most to assert performance improvements of LustreFS over HDFSv1 in dfsIO and Terasort (no gridmix?), which showed something important: Hadoop is the application that add DFS developers have to have a story for
Kelkoo uses a Big Data platform including Flume, HDFS, Spark on Yarn, and Hive/SparkSQL. Flume collects log data from various sources and aggregates it into HDFS for distributed storage. HDFS uses a namenode and datanodes for high availability. Spark on Yarn enables distributed processing of the data through Spark applications running executors and tasks across Yarn containers. Hive and SparkSQL allow querying and analyzing the data.
This talk discusses the current status of Hadoop security and some exciting new security features that are coming in the next release. First, we provide an overview of current Hadoop security features across the stack, covering Authentication, Authorization and Auditing. Hadoop takes a “defense in depth” approach, so we discuss security at multiple layers: RPC, file system, and data processing. We provide a deep dive into the use of tokens in the security implementation. The second and larger portion of the talk covers the new security features. We discuss the motivation, use cases and design for Authorization improvements in HDFS, Hive and HBase. For HDFS, we describe two styles of ACLs (access control lists) and the reasons for the choice we made. In the case of Hive we compare and contrast two approaches for Hive authrozation.. Further we also show how our approach lends itself to a particular initial implementation choice that has the limitation where the Hive Server owns the data, but where alternate more general implementation is also possible down the road. In the case of HBase, we describe cell level authorization is explained. The talk will be fairly detailed, targeting a technical audience, including Hadoop contributors.
In this Introduction to GlusterFS webinar, introduction and review of the GlusterFS architecture and key functionalities. Learn how GlusterFS is deployed in the datacenter, in the cloud, or between the two. We’ll also cover a brief update on GlusterFS v3.3 which is currently in beta.
OSDC 2010 | Use Distributed Filesystem as a Storage Tier by Fabrizio ManfredNETWAYS
Storage is one of the most important part of a data center, the complexity to design, build and delivering 24/forever availability service continues to increase every year. For these problems one of the best solution is a distributed filesystem (DFS) This talk describes the basic architectures of DFS and comparison among different free software solutions in order to show what makes DFS suitable for large-scale distributed environments. We explain how to use, to deploy, advantages and disadvantages, performance and layout on each solutions. We also introduce some Case Studies on implementations based on openAFS, GlusterFS and Hadoop finalized to build your own Cloud Storage.
Big data processing using hadoop poster presentationAmrut Patil
This document compares implementing Hadoop infrastructure on Amazon Web Services (AWS) versus commodity hardware. It discusses setting up Hadoop clusters on both AWS Elastic Compute Cloud (EC2) instances and several retired PCs running Ubuntu. The document also provides an overview of the Hadoop architecture, including the roles of the NameNode, DataNode, JobTracker, and TaskTracker in distributed storage and processing within Hadoop.
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
This document provides an overview of securing Hadoop applications and clusters. It discusses authentication using Kerberos, authorization using POSIX permissions and HDFS ACLs, encrypting HDFS data at rest, and configuring secure communication between Hadoop services and clients. The principles of least privilege and separating duties are important to apply for a secure Hadoop deployment. Application code may need changes to use Kerberos authentication when accessing Hadoop services.
The document discusses using Cloudera DataFlow to address challenges with collecting, processing, and analyzing log data across many systems and devices. It provides an example use case of logging modernization to reduce costs and enable security solutions by filtering noise from logs. The presentation shows how DataFlow can extract relevant events from large volumes of raw log data and normalize the data to make security threats and anomalies easier to detect across many machines.
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
The document outlines the 2021 finalists for the annual Data Impact Awards program, which recognizes organizations using Cloudera's platform and the impactful applications they have developed. It provides details on the challenges, solutions, and outcomes for each finalist project in the categories of Data Lifecycle Connection, Cloud Innovation, Data for Enterprise AI, Security & Governance Leadership, Industry Transformation, People First, and Data for Good. There are multiple finalists highlighted in each category demonstrating innovative uses of data and analytics.
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.
The document outlines the agenda for Cloudera's Enterprise Data Cloud event in Vienna. It includes welcome remarks, keynotes on Cloudera's vision and customer success stories. There will be presentations on the new Cloudera Data Platform and customer case studies, followed by closing remarks. The schedule includes sessions on Cloudera's approach to data warehousing, machine learning, streaming and multi-cloud capabilities.
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
Cloudera Fast Forward Labs’ latest research report and prototype explore learning with limited labeled data. This capability relaxes the stringent labeled data requirement in supervised machine learning and opens up new product possibilities. It is industry invariant, addresses the labeling pain point and enables applications to be built faster and more efficiently.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
In this session, we will cover how to move beyond structured, curated reports based on known questions on known data, to an ad-hoc exploration of all data to optimize business processes and into the unknown questions on unknown data, where machine learning and statistically motivated predictive analytics are shaping business strategy.
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
Watch this webinar to understand how Hortonworks DataFlow (HDF) has evolved into the new Cloudera DataFlow (CDF). Learn about key capabilities that CDF delivers such as -
-Powerful data ingestion powered by Apache NiFi
-Edge data collection by Apache MiNiFi
-IoT-scale streaming data processing with Apache Kafka
-Enterprise services to offer unified security and governance from edge-to-enterprise
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
Cloudera’s Data Science Workbench (CDSW) is available for Hortonworks Data Platform (HDP) clusters for secure, collaborative data science at scale. During this webinar, we provide an introductory tour of CDSW and a demonstration of a machine learning workflow using CDSW on HDP.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
Join Cloudera as we outline how we use Cloudera technology to strengthen sales engagement, minimize marketing waste, and empower line of business leaders to drive successful outcomes.
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on Azure. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
Join us to learn about the challenges of legacy data warehousing, the goals of modern data warehousing, and the design patterns and frameworks that help to accelerate modernization efforts.
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
Learn how organizations are deriving unique customer insights, improving product and services efficiency, and reducing business risk with a modern big data architecture powered by Cloudera on AWS. In this webinar, you see how fast and easy it is to deploy a modern data management platform—in your cloud, on your terms.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
Explore new trends and use cases in data warehousing including exploration and discovery, self-service ad-hoc analysis, predictive analytics and more ways to get deeper business insight. Modern Data Warehousing Fundamentals will show how to modernize your data warehouse architecture and infrastructure for benefits to both traditional analytics practitioners and data scientists and engineers.
The document discusses the benefits and trends of modernizing a data warehouse. It outlines how a modern data warehouse can provide deeper business insights at extreme speed and scale while controlling resources and costs. Examples are provided of companies that have improved fraud detection, customer retention, and machine performance by implementing a modern data warehouse that can handle large volumes and varieties of data from many sources.
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
Cloudera SDX is by no means no restricted to just the platform; it extends well beyond. In this webinar, we show you how Bardess Group’s Zero2Hero solution leverages the shared data experience to coordinate Cloudera, Trifacta, and Qlik to deliver complete customer insight.
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
Join Cloudera Fast Forward Labs Research Engineer, Mike Lee Williams, to hear about their latest research report and prototype on Federated Learning. Learn more about what it is, when it’s applicable, how it works, and the current landscape of tools and libraries.
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
451 Research Analyst Sheryl Kingstone, and Cloudera’s Steve Totman recently discussed how a growing number of organizations are replacing legacy Customer 360 systems with Customer Insights Platforms.
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
In this webinar, you will learn how Cloudera and BAH riskCanvas can help you build a modern AML platform that reduces false positive rates, investigation costs, technology sprawl, and regulatory risk.
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
How can companies integrate data science into their businesses more effectively? Watch this recorded webinar and demonstration to hear more about operationalizing data science with Cloudera Data Science Workbench on Cazena’s fully-managed cloud platform.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
This talk will cover ScyllaDB Architecture from the cluster-level view and zoom in on data distribution and internal node architecture. In the process, we will learn the secret sauce used to get ScyllaDB's high availability and superior performance. We will also touch on the upcoming changes to ScyllaDB architecture, moving to strongly consistent metadata and tablets.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Must Know Postgres Extension for DBA and Developer during MigrationMydbops
Mydbops Opensource Database Meetup 16
Topic: Must-Know PostgreSQL Extensions for Developers and DBAs During Migration
Speaker: Deepak Mahto, Founder of DataCloudGaze Consulting
Date & Time: 8th June | 10 AM - 1 PM IST
Venue: Bangalore International Centre, Bangalore
Abstract: Discover how PostgreSQL extensions can be your secret weapon! This talk explores how key extensions enhance database capabilities and streamline the migration process for users moving from other relational databases like Oracle.
Key Takeaways:
* Learn about crucial extensions like oracle_fdw, pgtt, and pg_audit that ease migration complexities.
* Gain valuable strategies for implementing these extensions in PostgreSQL to achieve license freedom.
* Discover how these key extensions can empower both developers and DBAs during the migration process.
* Don't miss this chance to gain practical knowledge from an industry expert and stay updated on the latest open-source database trends.
Mydbops Managed Services specializes in taking the pain out of database management while optimizing performance. Since 2015, we have been providing top-notch support and assistance for the top three open-source databases: MySQL, MongoDB, and PostgreSQL.
Our team offers a wide range of services, including assistance, support, consulting, 24/7 operations, and expertise in all relevant technologies. We help organizations improve their database's performance, scalability, efficiency, and availability.
Contact us: info@mydbops.com
Visit: https://www.mydbops.com/
Follow us on LinkedIn: https://in.linkedin.com/company/mydbops
For more details and updates, please follow up the below links.
Meetup Page : https://www.meetup.com/mydbops-databa...
Twitter: https://twitter.com/mydbopsofficial
Blogs: https://www.mydbops.com/blog/
Facebook(Meta): https://www.facebook.com/mydbops/
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...Fwdays
Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless.
As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency.
We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...DanBrown980551
This LF Energy webinar took place June 20, 2024. It featured:
-Alex Thornton, LF Energy
-Hallie Cramer, Google
-Daniel Roesler, UtilityAPI
-Henry Richardson, WattTime
In response to the urgency and scale required to effectively address climate change, open source solutions offer significant potential for driving innovation and progress. Currently, there is a growing demand for standardization and interoperability in energy data and modeling. Open source standards and specifications within the energy sector can also alleviate challenges associated with data fragmentation, transparency, and accessibility. At the same time, it is crucial to consider privacy and security concerns throughout the development of open source platforms.
This webinar will delve into the motivations behind establishing LF Energy’s Carbon Data Specification Consortium. It will provide an overview of the draft specifications and the ongoing progress made by the respective working groups.
Three primary specifications will be discussed:
-Discovery and client registration, emphasizing transparent processes and secure and private access
-Customer data, centering around customer tariffs, bills, energy usage, and full consumption disclosure
-Power systems data, focusing on grid data, inclusive of transmission and distribution networks, generation, intergrid power flows, and market settlement data
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
3. Flume
4 months after Hadoop
World 2010
Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric Sammer, Bruce Mitchener
Cloudera, Inc
Austin Hadoop Users Group 2/17/2011
4. Who Am I?
• Cloudera:
– Software Engineer on the Platform Team
– Flume Project Lead / Designer / Architect
• U of Washington:
– “On Leave” from PhD program
– Research in Systems and Programming
Languages
• Previously:
– Computer Security, Embedded Systems.
Austin Hadoop User Group, 2/17/2011 4
5. The basic scenario
• You have a bunch of servers
generating log files.
• You figured out that your logs are
valuable and you want to keep them
and analyze them.
• Because of the volume of data,
you’ve started using a Apache
Hadoop or Cloudera’s Distribution of
Apache Hadoop.
• … and you’ve got some ad-hoc, It’s log, log .. Everyone wants a log!
hacked together scripts that copy
data from servers to HDFS.
Austin Hadoop User Group, 2/17/2011 5
6. Ad-hockery gets complicated
• Reliability
– Will you data still get there … if your scripts fail? … if your hardware failed? … if HDFS goes
down? … if EC2 has flaked out?
• Scale
– As you add servers will your scripts keep up to 100GB’s per day? Will you have tons of small
files? Are you going to have tons of connections? Are you willing to suffer more latency to
mitigate?
• Manageability
– How do you know if the script failed on machine 172? What about logs from that other
system? How do you monitor and configure all the servers? Can you deal with elasticity?
• Extensibility
– Can you service custom logs? Send data to different places like Hbase, Hive or Incremental
search indexes? Can you do near-realtime?
• Blackbox
– What happens when the guy who write it leaves?
Austin Hadoop User Group, 2/17/2011 6
7. Cloudera Flume
Flume is a framework and conduit for
collecting and quickly shipping data records
from of many sources and to one centralized
place for storage and processing.
Project Principles:
• Scalability
• Reliability
• Extensibility
• Manageability
• Openness
Austin Hadoop User Group, 2/17/2011 7
8. : The Standard Use Case
server Agent
server Agent Collector
server Agent
server Agent
server Agent
server Agent Collector
server Agent
server Agent HDFS
server Agent
server Agent Collector
server Agent
server Agent
Agent tier Collector tier
Austin Hadoop User Group, 2/17/2011 8
9. : The Standard Use Case
Flume
server Agent
server Agent Collector
server Agent
server Agent
server Agent
server Agent Collector
server Agent
server Agent HDFS
server Agent
server Agent Collector
server Agent
server Agent
Agent tier Collector tier
Austin Hadoop User Group, 2/17/2011 9
10. : The Standard Use Case
Flume Master
server Agent
server Agent Collector
server Agent
server Agent
server Agent
server Agent Collector
server Agent
server Agent HDFS
server Agent
server Agent Collector
server Agent
server Agent
Agent tier Collector tier
Austin Hadoop User Group, 2/17/2011 10
11. : The Standard Use Case
Flume Master
server Agent
server Agent Collector
server Agent
server Agent
server Agent
server Agent Collector
server Agent
server Agent HDFS
server Agent
server Agent Collector
server Agent
server Agent
Agent tier Collector tier
Austin Hadoop User Group, 2/17/2011 11
12. Flume’s Key Abstractions
• Data path and control path node
• Nodes are in the data path Agent
source sink
– Nodes have a source and a sink
– They can take different roles node
• A typical topology has agent nodes and collector nodes.
Collector
source sink
• Optionally it has processor nodes.
• Masters are in the control path.
– Centralized point of configuration.
– Specify sources and sinks Master
– Can control flows of data between nodes
– Use one master or use many with a ZK-backed quorum
Austin Hadoop User Group, 2/17/2011 12
13. Flume’s Key Abstractions
• Data path and control path node
• Nodes are in the data path source sink
– Nodes have a source and a sink
– They can take different roles node
• A typical topology has agent nodes and collector nodes. source sink
• Optionally it has processor nodes.
• Masters are in the control path.
– Centralized point of configuration.
– Specify sources and sinks Master
– Can control flows of data between nodes
– Use one master or use many with a ZK-backed quorum
Austin Hadoop User Group, 2/17/2011 13
14. Can I has the codez?
node001: tail(“/var/log/app/log”) | autoE2ESink;
node002: tail(“/var/log/app/log”) | autoE2ESink;
…
node100: tail(“/var/log/app/log”) | autoE2ESink;
collector1: autoCollectorSource |
collectorSink(“hdfs://logs/app/”,”applogs”)
collector2: autoCollectorSource |
collectorSink(“hdfs://logs/app/”,”applogs”)
collector3: autoCollectorSource |
collectorSink(“hdfs://logs/app/”,”applogs”)
Austin Hadoop User Group, 2/17/2011 14
15. Outline
• What is Flume?
• Scalability
– Horizontal scalability of all nodes and masters
• Reliability
– Fault-tolerance and High availability
• Extensibility
– Unix principle, all kinds of data, all kinds of sources, all kinds of sinks
• Manageability
– Centralized management supporting dynamic reconfiguration
• Openness
– Apache v2.0 License and an active and growing community
Austin Hadoop User Group, 2/17/2011 15
17. : The Standard Use Case
Flume
server Agent
server Agent Collector
server Agent
server Agent
server Agent
server Agent Collector
server Agent
server Agent HDFS
server Agent
server Agent Collector
server Agent
server Agent
Agent tier Collector tier
Austin Hadoop User Group, 2/17/2011 17
18. Data path is horizontally scalable
server Agent
server Agent Collector
server Agent
server Agent HDFS
• Add collectors to increase availability and to handle more data
– Assumes a single agent will not dominate a collector
– Fewer connections to HDFS.
– Larger more efficient writes to HDFS.
• Agents have mechanisms for machine resource tradeoffs
• Write log locally to avoid collector disk IO bottleneck and catastrophic failures
• Compression and batching (trade cpu for network)
• Push computation into the event collection pipeline (balance IO, Mem, and CPU
resource bottlenecks)
Austin Hadoop User Group, 2/17/2011 18
20. Tunable failure recovery modes
• Best effort
– Fire and forget Agent Collector HDFS
• Store on failure + retry
– Local acks, local errors Agent Collector HDFS
detectable
– Failover when faults detected.
• End to end reliability Agent Collector
– End to end acks HDFS
– Data survives compound failures,
and may be retried multiple
times
Austin Hadoop User Group, 2/17/2011 20
21. Load balancing
Agent
Agent Collector
Agent
Agent Collector
Agent
Agent Collector
• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
exist
• Spread load if a collector goes down.
• Spread load if new collectors added to the system.
Austin Hadoop User Group, 2/17/2011 21
22. Load balancing and collector failover
Agent
Agent Collector
Agent
Agent Collector
Agent
Agent Collector
• Agents are logically partitioned and send to different collectors
• Use randomization to pre-specify failovers when many collectors
exist
• Spread load if a collector goes down.
• Spread load if new collectors added to the system.
Austin Hadoop User Group, 2/17/2011 22
23. Control plane is horizontally scalable
Node Master ZK1
Node Master ZK2
Node Master ZK3
• A master controls dynamic configurations of nodes
– Uses consensus protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
Austin Hadoop User Group, 2/17/2011 23
24. Control plane is horizontally scalable
Node Master ZK1
Node Master ZK2
Node Master ZK3
• A master controls dynamic configurations of nodes
– Uses consensus protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
Austin Hadoop User Group, 2/17/2011 24
25. Control plane is horizontally scalable
Node Master ZK1
Node Master ZK2
Node Master ZK3
• A master controls dynamic configurations of nodes
– Uses consensus protocol to keep state consistent
– Scales well for configuration reads
– Allows for adaptive repartitioning in the future
• Nodes can talk to any master.
• Masters can talk to an existing ZK ensemble
Austin Hadoop User Group, 2/17/2011 25
27. Centralized Dataflow Management Interfaces
• One place to specify node
sources, sinks and data
flows.
• Basic Web interface
• Flume Shell
– Command line interface
– Scriptable
• Cloudera Enterprise
– Flume Monitor App
– Graphical web interface
Austin Hadoop User Group, 2/17/2011 27
28. Configuring Flume
fan console
tail filter
out roll hdfs
Node: tail(“file”) | filter [ console, roll(1000) {
dfs(“hdfs://namenode/user/flume”) } ] ;
• A concise and precise configuration language for specifying dataflows in
a node.
• Dynamic updates of configurations
– Allows for live failover changes
– Allows for handling newly provisioned machines
– Allows for changing analytics
Austin Hadoop User Group, 2/17/2011 28
29. Output bucketing
Collector /logs/web/2010/0715/1200/data-xxx.txt
/logs/web/2010/0715/1200/data-xxy.txt
/logs/web/2010/0715/1300/data-xxx.txt
HDFS /logs/web/2010/0715/1300/data-xxy.txt
/logs/web/2010/0715/1400/data-xxx.txt
…
Collector
node : collectorSource | collectorSink
(“hdfs://namenode/logs/web/%Y/%m%d/%H00”, “data”)
• Automatic output file management
– Write hdfs files in over time based tags
Austin Hadoop User Group, 2/17/2011 29
31. Flume is easy to extend
• Simple source and sink APIs
– An event streaming design
– Many simple operations composes for complex behavior
• Plug-in architecture so you can add your own sources, sinks and
decorators and sinks
fan sink
source deco
out deco sink
Austin Hadoop User Group, 2/17/2011 31
32. Variety of Connectors
• Sources produce data
– Console, Exec, Syslog, Scribe, IRC, Twitter,
– In the works: JMS, AMQP, pubsubhubbub/RSS/Atom
• Sinks consume data source
– Console, Local files, HDFS, S3
– Contributed: Hive (Mozilla), Hbase (Sematext), Cassandra
(Riptano/DataStax), Voldemort, Elastic Search
– In the works: JMS, AMQP
sink
• Decorators modify data sent to sinks
– Wire batching, compression, sampling, projection,
extraction, throughput throttling
– Custom near real-time processing (Meebo)
– JRuby event modifiers (InfoChimps) deco
– Cryptographic extensions(Rearden)
Austin Hadoop User Group, 2/17/2011 32
33. : Multi Datacenter
Collector tier
api
api
api Agent
Agent
Agent Collector
API server
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent HDFS
Processor server
api Agent
api Agent
api Agent
proc Agent Collector
api Agent
api Agent
api Agent
proc Agent Collector
api Agent
api Agent
api Agent
proc Agent Collector
Austin Hadoop User Group, 2/17/2011 33
34. : Multi Datacenter
Collector tier
api
api
api Agent
Agent
Agent Collector
API server
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent
api
api
api Agent
Agent
Agent Collector
api Agent Relay HDFS
Processor server
api Agent
api Agent
api Agent
proc Agent Collector
api Agent
api Agent
api Agent
proc Agent Collector
api Agent
api Agent
api Agent
proc Agent Collector
Austin Hadoop User Group, 2/17/2011 34
35. : Near Realtime Aggregator
Flume
Ad svr Agent
Ad svr Agent Tracker Collector HDFS
Ad svr Agent
Ad svr Agent
quick
reports DB Hive job
verify
reports
Austin Hadoop User Group, 2/17/2011 35
36. An enterprise story
Flume
Collector tier
api
api
api Agent
Agent
Agent Collector
API server
api Win
Kerberos HDFS
api
api
api Agent
Agent
Agent Collector
api Linux
DD DD DD
api
api
api Agent
Agent
Agent Collector
api Linux
Active Directory
/ LDAP
Austin Hadoop User Group, 2/17/2011 36
37. An emerging community story
Flume
Agent
Agent Hive query
svr Agent
Agent HDFS Pig query
hdfs
Key lookup
Collector Fanout hbase HBase Range query
index
Incremental Search query
Search Idx Faceted query
Austin Hadoop User Group, 2/17/2011 37
39. Flume is Open Source
• Apache v2.0 Open Source License
– Independent from Apache Software Foundation
• GitHub source code repository
– http://github.com/cloudera/flume
– Regular tarball update versions every 2-3 months.
– Regular CDH packaging updates every 3-4 months.
• Review Board for code review
• New external committers wanted!
– Cloudera folks: Jonathan Hsieh, Henry Robinson, Patrick Hunt, Eric
Sammer
– Independent folks: Bruce Mitchener
Austin Hadoop User Group, 2/17/2011 39
40. Growing user and developer community
• History:
– Initial Open Source Release, June 2010
• Growth:
– Pre-Hadoop Summit (Late June 2010):
• 4 followers, 4 forks (original authors)
– Pre-Hadoop World (October 2010):
• 174 followers, 34 forks
– Pre-CDH3B4 Release (February 2011):
• 288 followers, 51 forks
Austin Hadoop User Group, 2/17/2011 40
41. Support
• Community-based mailing lists for support
– “an answer in a few days”
– User: https://groups.google.com/a/cloudera.org/group/flume-user
– Dev: https://groups.google.com/a/cloudera.org/group/flume-dev
• Community-based IRC chat room
– “quick questions, quick answers”
– #flume in irc.freenode.net
• Commercial support with Cloudera Enterprise subscription
– Chat with sales@cloudera.com
Austin Hadoop User Group, 2/17/2011 41
43. Summary
• Flume is a distributed, reliable, scalable, extensible system for
collecting and delivering high-volume continuous event data such
as logs.
– It is centrally managed, which allows for automated and adaptive
configurations.
– This design allows for near-real time processing.
– Apache v2.0 License with active and growing community
• Part of Cloudera’s Distribution for Hadoop, about to be refreshed
for CDH3b4.
Austin Hadoop User Group, 2/17/2011 43
44. Questions? (and shameless plugs)
• Contact info:
– jon@cloudera.com
– Twitter @jmhsieh
• Cloudera Training in Dallas
– Hadoop Training for Developers - March 14-16
– Hadoop Training for Administrators - March 17-18
– Sign up at http://cloudera.eventbrite.com
– 10% discount code for classes "hug“
• Cloudera is Hiring!
Austin Hadoop User Group, 2/17/2011 44