This document provides an overview of large scale data ingestion using Apache Flume. It discusses why event streaming with Flume is useful, including its scalability, event routing capabilities, and declarative configuration. It also covers Flume concepts like sources, channels, sinks, and how they connect agents together reliably in a topology. The document dives into specific source, channel, and sink types including examples and configuration details. It also discusses interceptors, channel selectors, sink processors, and ways to integrate Flume into applications using client SDKs and embedded agents.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
The driving question behind redesigns of countless data collection architectures has often been, ?how can we make the data available to our analytical systems faster?? Increasingly, the go-to solution for this data collection problem is Apache Flume. In this talk, architectures and techniques for designing a low-latency Flume-based data collection and delivery system to enable Hadoop-based analytics are explored. Techniques for getting the data into Flume, getting the data onto HDFS and HBase, and making the data available as quickly as possible are discussed. Best practices for scaling up collection, addressing de-duplication, and utilizing a combination streaming/batch model are described in the context of Flume and Hadoop ecosystem components.
Apache Flume is a simple yet robust data collection and aggregation framework which allows easy declarative configuration of components to pipeline data from upstream source to backend services such as Hadoop HDFS, HBase and others.
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
Apache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
Introduction of Apache Kafka - the open source platform for real time message queuing and reliable, scalable, distributed event handling and high volume pub/sub implementation.
see GitHub https://github.com/MaartenSmeets/kafka-workshop for the workshop resources.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
Apache Flume’s extensible architecture allows Cisco to stream system and application logs from worldwide production data centers to a central Hadoop cluster and Solr. This architecture enables a new level of scalable indexing so that a larger volume of logs is searchable within seconds. Using Solr 4.0′s near real time features together with Hadoop, we can execute mission critical tasks much quicker, improving our ability to meet tight SLAs. At the same time, using the same infrastructure, we can perform large-scale historical analysis and pattern extraction to help further improve our services. This talk will explore our infrastructure and decisions we?ve made to meet key requirements, i.e. high indexing load, high availability and disaster recovery. We will further explore other uses of Flume and SolrCloud within Cisco including dynamic event routing, parsing and multi-tenancy.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...Lucas Jellema
Introduction of Apache Kafka - the open source platform for real time message queuing and reliable, scalable, distributed event handling and high volume pub/sub implementation.
see GitHub https://github.com/MaartenSmeets/kafka-workshop for the workshop resources.
Reducing Microservice Complexity with Kafka and Reactive Streamsjimriecken
My talk from ScalaDays 2016 in New York on May 11, 2016:
Transitioning from a monolithic application to a set of microservices can help increase performance and scalability, but it can also drastically increase complexity. Layers of inter-service network calls for add latency and an increasing risk of failure where previously only local function calls existed. In this talk, I'll speak about how to tame this complexity using Apache Kafka and Reactive Streams to:
- Extract non-critical processing from the critical path of your application to reduce request latency
- Provide back-pressure to handle both slow and fast producers/consumers
- Maintain high availability, high performance, and reliable messaging
- Evolve message payloads while maintaining backwards and forwards compatibility.
Nozomi from Yahoo! Japan gave a presentation how Yahoo! Japan uses Apache Pulsar to build their internal messaging platform for processing tens of billions of messages every day. He explains why Yahoo! Japan choose Pulsar and what are the use cases of Apache Pulsar and their best practices.
#PulsarBeijingMeetup
Emerging technologies /frameworks in Big DataRahul Jain
A short overview presentation on Emerging technologies /frameworks in Big Data covering Apache Parquet, Apache Flink, Apache Drill with basic concepts of Columnar Storage and Dremel.
Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements.
Topics include:
- What latencies and throughputs you should expect from Kafka
- How to select hardware and size components
- What you should be monitoring
- Design patterns and antipatterns for client applications
- How to go about diagnosing performance bottlenecks
- Which configurations to examine and which ones to avoid
If you want to stay up to date, subscribe to our newsletter here: https://bit.ly/3tiw1I8
An introduction to Apache Flume that comes from Hadoop Administrator Training delivered by GetInData.
Apache Flume is a distributed, reliable, and available service for collecting, aggregating, and moving large amounts of log data. By reading these slides, you will learn about Apache Flume, its motivation, the most important features, architecture of Flume, its reliability guarantees, Agent's configuration, integration with the Apache Hadoop Ecosystem and more.
"Analyzing Twitter Data with Hadoop - Live Demo", presented at Oracle Open World 2014. The repository for the slides is in https://github.com/cloudera/cdh-twitter-example
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
How we, at eXelate, built an ETL pipeline for Elasticsearch using Spark, including :
* Processing the data using Spark.
* Indexing the processed data directly into Elasticsearch using elasticsearch-hadoop plugin-in for Spark.
* Managing the flow using some of the services provided by AWS (EMR, Data Pipeline, etc.).
The presentation includes some tips and discusses some of the pitfalls we encountered while setting-up this process.
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
Speaker: Hari Shreedharan
Data Day Texas 2015
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Introduction to Akka 2. Explains what Akka's actors are all about and how to utilize them to write scalable and fault-tolerant systems.
Talk given at JavaZone 2012.
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
Hadoop is a great platform for storing and processing massive amounts of data. Elasticsearch is the ideal solution for Searching and Visualizing the same data. Join us to learn how you can leverage the full power of both platforms to maximize the value of your Big Data.
In this webinar we'll walk you through:
How Elasticsearch fits in the Modern Data Architecture.
A demo of Elasticsearch and Hortonworks Data Platform.
Best practices for combining Elasticsearch and Hortonworks Data Platform to extract maximum insights from your data.
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Everyone in the Scala world is using or looking into using Akka for low-latency, scalable, distributed or concurrent systems. I'd like to share my story of developing and productionizing multiple Akka apps, including low-latency ingestion and real-time processing systems, and Spark-based applications.
When does one use actors vs futures?
Can we use Akka with, or in place of, Storm?
How did we set up instrumentation and monitoring in production?
How does one use VisualVM to debug Akka apps in production?
What happens if the mailbox gets full?
What is our Akka stack like?
I will share best practices for building Akka and Scala apps, pitfalls and things we'd like to avoid, and a vision of where we would like to go for ideal Akka monitoring, instrumentation, and debugging facilities. Plus backpressure and at-least-once processing.
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
Elasticsearch provides native integration with Apache Spark through ES-Hadoop. However, especially during development, it is at best cumbersome to have Elasticsearch running in a separate machine/instance. Leveraging Spark Cluster with Elasticsearch Inside it is possible to run an embedded instance of Elasticsearch in the driver node of a Spark Cluster. This opens up new opportunities to develop cutting-edge applications. One such application is Dataset Search.
Oscar will give a demo of a Dataset Search Engine built on Spark Cluster with Elasticsearch Inside. Motivation is that once Elasticsearch is running on Spark it becomes possible and interesting to have the Elasticsearch in-memory instance join an (existing) Elasticsearch cluster. And this in turn enables indexing of Datasets that are processed as part of Data Pipelines running on Spark. Dataset Search and Data Management are R&D topics that should be of interest to Spark Summit East attendees who are looking for a way to organize their Data Lake and make it searchable.
Gemini Mobile Technologies ("Gemini") released a Real-Time Log Processing System based on Flume and Cassandra ("Flume-Cassandra Log Processor") as open source. The Flume-Cassandra Log Processor enables massive volumes of production system logs to be collected and processed into graphical reports, in real-time. In addition, logs from multiple data centers can be simultaneously aggregated and analyzed in a single database.
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
Part of the core Hadoop project, YARN is the architectural center of Hadoop that allows multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, unlocking an entirely new approach to analytics. It is the foundation of the new generation of Hadoop and is enabling organizations everywhere to realize a Modern Data Architecture.
Design data pipeline to gather log events and transform it to queryable data with HIVE ddl.
This covers Java applications with log4j and non-java unix applications using rsyslog.
Introduction to the management of data persistency in FIWARE with the different approach adopted by the FIWARE Community. What is a Time Series Database. What are the different between the different solutions adopted.
First slide
1) Apache Flume is a distributed and available service, in which it can collect and move large amount of streaming data from one location to another.
2) Most frequently it will deliver the log data into HDFS.
Second slide
1) Event and Client are the logical components of flume.
2) An Event is a Singular unit of data which can be transported by Flume NG from its Source to destination.
3) Typically an Event will be composed of Zero or more headers and a body. Here the headers will be used for contextual routing. This means by using the Header definition we can rout the data to the next eligible destination.
4) Client is an Event generator. It will generate the events and send it to one or more agents.
Eg: Apache webservers, which generates continuously a huge amount of log data.
Third slide
1) Flume agent is a JVM Daemon service, which holds all Flume-NG components like Sources, Channels, Sinks...etc.
2) Here the Source will send the events to channel and channel will stored it, later the channel will send the events to sink.
Fourth slide
1) Source is an active component, which receives data from different locations and places it on one or more Channels.
2) The declaration of source component in “.conf” file of agent “a1” is listed here. In this s1 means Source component, a1 means agent.
a1.sources=s1
a1.sources.s1.type=netcat (netcat is one of the Source type)
3) There are different Source types are available like Pollable (Means Auto generating like “tail –F” command and sequencing command), event driven and Netcat.
4) Even we can write our won Source type and specify that Custom class name to source type parameter.
Fifth slide
1) A channel is a bridge between Source and Sink.
2) Channel will store the Source events and send it to Sink.
3) There are three different types of Channels like memory channel which is very fast but no guarantee for data loss. And file channel which will store the events in a file system before sending it to sink. And the third one is database channel which will store the events in database.
4) Single Channel can be connected to any number of Sources and Sinks.
Sixth slide
1) A sink receives events from one channel only.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
Talk presented by Aarón Fas & Andrés Viedma at the JBcnConf 2015.
'Microservices' is one of the most popular buzzwords in the industry now, but are they really a step forward? Or they might be more a problem than a solution? When are they really helpful? How should they be addressed? What challenges will we face if we decide to implement a microservices based architecture?
One year ago, Tuenti moved from a monolithic PHP backend to a Java + PHP microservices architecture. In this talk, we'll share our experiences so far: how we addressed the change, how we implemented it, why we think it's been valuable for us (and how is that related to the company culture), why it might not be a good idea for your company / application and, mostly, what lessons we have learned from this experience.
With more and more companies adopting microservices and service-oriented architectures, it becomes clear that the HTTP/RPC synchronous communication (while great) is not always the best option for every use case.
In this presentation, I discuss two approaches to an asynchronous event-based architecture. The first is a "classic" style protocol (Python services driven by callbacks with decorators communicating using a messaging layer) that we've been implementing at Demonware (Activision) for Call of Duty back-end services. The second is an actor-based approach (Scala/Akka based microservices communicating using a messaging layer and a centralized router) in place at Bench Accounting.
Both systems, while event based, take different approaches to building asynchronous, reactive applications. This talk explores the benefits, challenges, and lessons learned architecting both Actor and Non-Actor systems.
Data Stream Processing with Apache FlinkFabian Hueske
This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.
The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.
Stream Processing is emerging as a popular paradigm for data processing architectures, because it handles the continuous nature of most data and computation and gets rid of artificial boundaries and delays. In this talk, we are going to look at some of the most common misconceptions about stream processing and debunk them.
- Myth 1: Streaming is approximate and exactly-once is not possible.
- Myth 2: Streaming is for real-time only.
- Myth 4: Streaming is harder to learn than Batch Processing.
- Myth 3: You need to choose between latency and throughput.
We will look at these and other myths and debunk them at the example of Apache Flink. We will discuss Apache Flink's approach to high performance stream processing with state, strong consistency, low latency, and sophisticated handling of time. With such building blocks, Apache Flink can handle classes of problems previously considered out of reach for stream processing. We also take a sneak preview at the next steps for Flink.
"Data Provenance: Principles and Why it matters for BioMedical Applications"Pinar Alper
Tutorial given at Informatics for HEalth 2017 COnference These slides are for the second part of the tutorial describing provenance capture and management tools.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
Key Concepts
Endpoints and Addresses
Deployment Units
Mediation Support APIs
Error Handling
Interceptors
Configuration Externalization
Invoke an Asynchronous Flow
Usage of Message Files
UltraESB Clustering
Metrics and Alerting
Monitoring and Management
EMW Framework
https://www.adroitlogic.com
https://developer.adroitlogic.com
Similar to Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume (20)
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
Feb 2013 HUG: Large Scale Data Ingest Using Apache Flume
1. Large Scale Data Ingest Using NOT USE PUBLICLY
DO
Apache Flume PRIOR TO 10/23/12
Headline Goes Here
Hari Shreedharan
Speaker Name or Subhead Goes Here
Software Engineer , Cloudera
Apache Flume PMC member / committer
February 2013
1
2. Why event streaming with Flume is awesome
• Couldn’t I just do this with a shell script?
• What year is this, 2001? There is a better way!
• Scalable collection, aggregation of event data (i.e. logs)
• Dynamic, contextual event routing
• Low latency, high throughput
• Declarative configuration
• Productive out of the box, yet powerfully extensible
• Open source software
2
3. Lessons learned from Flume OG
• Hard to get predictable performance without decoupling tier
impedance
• Hard to scale-out without multiple threads at the sink level
• A lot of functionality doesn’t work well as a decorator
• People need a system that keeps the data flowing when there is
a network partition (or downed host in the critical path)
3
6. Basic Concepts
• Client • Valid Configuration
• Log4j Appender • Must have at least one
• Client SDK Channel
• Clientless Operation • Must have at least one
source or sink
• Agent
• Any number of sources
• Source
• Any number of channels
• Channel
• Any number of Sinks
• Sink
6
7. Concepts in Action
• Source: Puts events into the Channel
• Sink: Drains events from the Channel
• Channel: Store the events until drained
7
8. Flow Reliability
success
Reliability based on:
• Transactional Exchange between Agents
• Persistence Characteristics of Channels in the Flow
Also Available:
• Built-in Load balancing Support
• Built-in Failover Support
8
9. Reliability
• Transactional guarantees from channel
• External client needs handle retry
• Built in avro-client to read streams
• Avro source for multi-hop flows
• Use Flume Client SDK for customization
9
12. Basic Configuration Rules
# Active components
agent1.sources = src1 • Only the named agents’ configuration loaded
agent1.channels = ch1
agent1.sinks = sink1
• Only active components’ configuration
# Define and configure src1 loaded within the agents’ configuration
agent1.sources.src1.type = netcat
agent1.sources.src1.channels = ch1
agent1.sources.src1.bind = 127.0.0.1
• Every Agent must have at least one channel
agent1.sources.src1.port = 10112
• Every Source must have at least one channel
# Define and configure sink1
agent1.sinks.sink1.type = logger • Every Sink must have exactly one channel
agent1.sinks.sink1.channel = ch1
• Every component must have a type
# Define and configure ch1
agent1.channels.ch1.type = memory
# Some other Agents’ configuration
agent2.sources = src1 src2
12
13. Deployment
Steady state inflow == outflow
4 Tier 1 agents at 100 events/sec (batch-size)
1 Tier 2 agent at 400 eps
13
14. Source
• Event Driven
• Supports Batch Processing
• Source Types:
• AVRO – RPC source – other Flume agents can send data to this source port
• THRIFT – RPC source (available in next Flume release)
• SPOOLDIR – pick up rotated log files
• HTTP – post to a REST service (extensible)
• JMS – ingest from Java Message Service
• SYSLOGTCP, SYSLOGUDP
• NETCAT
• EXEC
14
15. How Does a Source Work?
• Read data from external clients/other sinks
• Stores events in configured channel(s)
• Asynchronous to the other end of channel
• Transactional semantics for storing data
15
21. RPC Sources – Avro and Thrift
• Reading events from external client
• Only TCP
• Connecting two agents in a distributed flow
• Based on IPC thus failure notification is enabled
• Configuration
agent_foo.sources.rpcsource-1.type = avro/thrift
agent_foo.sources.rpcsource-1.bind = <host>
agent_foo.sources.rpcsource-1.port = <port>
21
22. Spooling Directory Source
• Parses rotated log files out of a “spool” directory
• Watches for new files, renames or deletes them when done
• The files must be immutable before being placed into the
watched directory
agent.sources.spool.type = spooldir
agent.sources.spool.spoolDir = /var/log/spooled-files
agent.sources.spool.deletePolicy = never OR immediate
22
23. HTTP Source
• Runs a web server that handles HTTP requests
• The handler is pluggable (can roll your own)
• Out of the box, an HTTP client posts a JSON array of events to
the server. Server parses the events and puts them on the
channel.
agent.sources.http.type = http
agent.sources.http.port = 8081
23
24. HTTP Source, cont’d.
• Default handler supports events that look like this:
[{
"headers" : {
"timestamp" : "434324343",
"host" : ”host1.example.com"
},
"body" : ”arbitrary data in body string"
},
{
"headers" : {
"namenode" : ”nn01.example.com",
"datanode" : ”dn102.example.com"
},
"body" : ”some other arbitrary data in body string"
}]
24
25. Exec Source
• Reading data from a output of a command
• Can be used for ‘tail –F ..’
• Doesn’t handle failures ..
Configuration:
agent_foo.sources.execSource.type = exec
agent_foo.sources.execSource.command = 'tail -F /var/log/weblog.out’
25
26. JMS Source
• Reads messages from a JMS queue or topic, converts them to Flume events
and puts those events onto the channel.
• Pluggable Converter that by default converrts Bytes, Text, and Object
messages into Flume Events.
• So far, tested with ActiveMQ. We’d like to hear about experiences with any
other JMS implementations.
agent.sources.jms.type = jms
agent.sources.jms.initialContextFactory =
org.apache.activemq.jndi.ActiveMQInitialContextFactory
agent.sources.jms.providerURL = tcp://mqserver:61616
agent.sources.jms.destinationName = BUSINESS_DATA
agent.sources.jms.destinationType = QUEUE
26
27. Interceptor
• Applied to Source configuration element
• One source can have many interceptors
• Chain-of-responsibility
• Can be used for tagging, filtering, routing*
• Built-in interceptors:
• TIMESTAMP
• HOST
• STATIC
• REGEX EXTRACTOR
27
31. Channel
• Passive Component
• Determines the reliability of a flow
• “Stock” channels that ship with Flume
• FILE – provides durability; most people use this
• MEMORY – lower latency for small writes, but not durable
• JDBC – provides full ACID support, but has performance issues
31
32. File Channel
• Write Ahead Log implementation
• Configuration:
agent1.channels.ch1.type = FILE
agent1.channels.ch1.checkpointDir = <dir>
agent1.channels.ch1.dataDirs = <dir1> <dir2>…
agent1.channels.ch1.capacity = N (100k)
agent1.channels.ch1.transactionCapacity = n
agent1.channels.ch1.checkpointInterval = n (30000)
agent1.channels.ch1.maxFileSize = N (1.52G)
agent1.channels.ch1.write-timeout = n (10s)
agent1.channels.ch1.checkpoint-timeout = n (600s)
32
33. File Channel
Flume Event Queue
• In memory representation of the
channel
• Maintains queue of pointers to
the data on disk in various log
files. Reference counts log files.
• Is memory mapped to a check
point file
Log Files
• On disk representation of actions
(Puts/Takes/Commits/Rollbacks)
• Maintains actual data
• Log files with 0 refs get deleted
33
34. Sink
• Polling Semantics
• Supports Batch Processing
• Specialized Sinks
• HDFS (Write to HDFS – highly configurable)
• HBASE, ASYNCHBASE (Write to Hbase)
• AVRO (IPC Sink – Avro Source as IPC source at next hop)
• THRIFT (IPC Sink – Thrift Source as IPC source at next hop)
• FILE_ROLL (Local disk, roll files based on size, # of events etc)
• NULL, LOGGER (For Testing Purposes)
• ElasticSearch
• IRC
34
35. HDFS Sink
• Writes events to HDFS (what!)
• Configuring (taken from Flume User Guide):
35
36. HDFS Sink
• Supports dynamic directory naming using tags
• Use event headers : %{header}
• Eg: hdfs://namenode/flume/%{header}
• Use timestamp from the event header
• Use various options to use this.
• Eg: hdfs://namenode/flume/%{header}/%Y-%m-%D/
• Use roundValue and roundUnit to round down the timestamp to use
separate directories.
• Within a directory – files rolled based on:
• rollInterval – time since last event was written
• rollSize – max size of the file
• rollCount – max # of events per file
36
37. AsyncHBase Sink
• Insert events and increments into Hbase
• Writes events asynchronously at very high rate.
• Easy to configure:
• table
• columnFamily
• batchSize - # events per txn.
• timeout - how long to wait for success callback
• serializer/serializer.* - Custom serializer can decide how and where the events
are written out.
37
38. IPC Sinks (Avro/Thrift)
• Sends events to the next hop’s IPC Source
• Configuring:
• hostname
• port
• batch-size - # events per txn/batch sent to next hop
• request-timeout – how long to wait for success of batch
38
39. Serializers
• Supported by HDFS, Hbase and File_Roll sink
• Convert the event into a format of user’s choice.
• In case of Hbase, convert an event into Puts and Increments.
39
40. Sink Group
• Top-level element, needed to declare sink processors
• A sink can be at most in one group at anytime
• By default all sinks are in their individual default sink group
• Default sink group is a pass-through
• Deactivating sink-group does not deactivate the sink!!
40
41. Sink Processor
• Acts as a Sink Proxy
• Can work with multiple Sinks
• Built-in Sink Processors:
• DEFAULT
• FAILOVER
• LOAD_BALANCE
• Applied via Groups!
• A Top-Level Component
41
43. Clients: Embedded agent
• More advanced RPC client. Integrates a channel.
• Minimal example:
properties.put("channel.type", "memory");
properties.put("channel.capacity", "200");
properties.put("sinks", "sink1");
properties.put("sink1.type", "avro");
properties.put("sink1.hostname", "collector1.example.com");
properties.put("sink1.port", "5564");
EmbeddedAgent agent = new EmbeddedAgent("myagent");
agent.configure(properties);
agent.start();
List<Event> events = new ArrayList<Event>();
events.add(event);
agent.putAll(events);
agent.stop();
• See Flume Developer Guide for more details and examples.
43
44. General Caveats
• Reliability = function of channel type, capacity, and system
redundancy
• Carefully size the channels for needed capacity
• Set batch sizes based on projected drain requirements
• Number of cores should be ½ total # of sources & sinks
combined in an agent
44
46. Summary
• Clients send Events to Agents
• Each agent hosts Flume components: Source, Interceptors, Channel
Selectors, Channels, Sink Processors & Sinks
• Sources & Sinks are active components, Channels are passive
• Source accepts Events, passes them through Interceptor(s), and if not
filtered, puts them on channel(s) selected by the configured Channel
Selector
• Sink Processor identifies a sink to invoke, that can take Events from a
Channel and send it to its next hop destination
• Channel operations are transactional to guarantee one-hop delivery
semantics
• Channel persistence provides end-to-end reliability
46
47. Reference docs (1.3.1 release)
User Guide:
flume.apache.org/FlumeUserGuide.html
Dev Guide:
flume.apache.org/FlumeDeveloperGuide.html
47
48. Blog posts
• Flume performance tuning
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
• Flume and Hbase
https://blogs.apache.org/flume/entry/streaming_data_into_apache_hbase
• File Channel Innards
https://blogs.apache.org/flume/entry/apache_flume_filechannel
• Architecture of Flume NG
https://blogs.apache.org/flume/entry/flume_ng_architecture
48
49. Contributing: How to get involved!
• Join the mailing lists:
• user-subscribe@flume.apache.org
• dev-subscribe@flume.apache.org
• Look at the code
• github.com/apache/flume – Mirror of the Apache Flume git repo
• File or fix a JIRA
• issues.apache.org/jira/browse/FLUME
• More on how to contribute:
• cwiki.apache.org/confluence/display/FLUME/How+to+Contribute
49
51. DO NOT USE PUBLICLY
Thank you PRIOR TO 10/23/12
Headline Goes Here
Reach out on the mailing lists!
Speaker Name or Subhead Goes Here
Follow me on Twitter: @harisr1234
51
Editor's Notes
If you have a server farm that emits log data in GB/min, then you could hack together a very simple aggregator, but chances are it won't provide reliability, manageability, or scalability.This is why many use Flume: an out-of-the-box aggregator that is an open-source, high-performing, reliable, and scalable aggregator for streaming data.You don’t want to risk outages or scripts failing causing an overload on spindles.Flume is declarative in that you don’t have to write codeFlume is extensible in that you can write your own components to go on top of Flume, which allow you to modify the behavior and feature-set of Flume out of the boxFlume has one hop delivery, if you want end-to-end reliability, use file channel, which we’ll talk about laterNo acknowledgements from terminal destination to client b/c then client forced to hold all events until ack receivedYou want these systems to be occupy less disk footprintSet up redundant flows if you’re concerned about hardware failures, flume doesn’t support splicing or raid out of the box
With Flume NG, there is built-in buffering capacity at every hop. Thus, data and events will be preserved. In regards, to single-hop reliability, the degree of reliability is based on the channel: memory channel and recoverable memory channel are best-effort, whereas file channel and jdbc channel are reliable because you write to disk.OGgarden hose connected from faucet to sprinklercontiguous flow except when you pinch the hose in the middleNGhose connects multiple water tanks (i.e. channels/passive buffers) from faucet to sprinklerif you pinch the hose, the flow doesn't stop1. decouple impedance between producers and consumers2. dynamic routing capabilities (can shutdown one tank to re-route traffic)3. unrestricted capacity (consumer's input no longer restricted by producer's output as one tank can feed into multiple downstream tanks)
Flume flowSimplest individual component is agent which can talk to each other and to hdfs,hbase, etcClients talk to agents
Clientless operation – agent loads up info using specialized sourcesAgent is a collection of sources, channels, sinksSource captures events from external, only exec source can generate events on its ownChannel is buffer between source and sinkSink has responsibility of draining channel out to another agent or terminal point like hdfsYou can’t have a source with no place to write events
In upper diagram, the 3 agents’ flow is healthyIn lower diagram, sink fails to communicate with downstream source thus reservoir fills up and the reservoir filling up cascades upstream, buffering from downstream hardware failuresBut no events are lost until all channels in that flow fill up, at which point the sources report failure to the clientSteadystate flow restored when link becomes active
WHAT MAKES IT ACTIVE?Src2 is inactive b/c it’s not in the active setDefine multiple sources for same agent by space separated listsFan out: source write to two channelsMultiple sinks drain same channel for increased throughputSource can write to multiple channelsChannel is implemented as queue: source appends data to end of queue and sink drains from head of queueConfig file is checked at startup and changes are checked for every 30 sec – don’t have to restart agents if config file changedWhat use-case would need to have multiple sinks draining the same channel?Sources are multi-threaded and greedily implemented (for improved throughput)Sinks are single-threaded and have fixed capacity on what they can drainImpedance mismatch between sources and sinksSources will expand to accommodate load, bursty traffic, so downstream won’t be affectedSinks will drain steadilyAdd another sink to the same channel to meet steady-state requirement
Four tier1 agents drain into one tier2 agent then distributes its load over two tier3 agentsYou can have a single config file for 3 agents and you pass that around your deployment and you’re doneAt any node, ingest rate must equal exit rate
Avro is standardChannels support transactionsflume sources:avroexecsyslogspooling directoryhttpembedded agentJMS
Transactional semantics for storing dataif sink takes data out, it will commit only if source on next hop has committed its data
Use-cases:You want the same data to go into hdfs and into hbasePriority based routingAny contextual routing
JMS – client talks to broker, which handles failures
on avro, once the source commits the events on its channel via a put transaction, the source sends a success msg to the previous hop and the sink on the previous hop deletes these events once it commits the take transaction
Takes a command as a config parameter and executes that command, whatever it writes to stdout, it will write each event out to the channelIf channel is full, data is dropped and lostDuring file rotation, if event fails, then data is lost
Interceptor is transparent component that gets applied to the flow and can do filtering and minor modification of the event but can’t have interceptor do multiplication of event – e.g. can’t do decompression of event because batching, compression are framework level concerns that Flume should addressOverall number of events emitted by the interceptor can not be more than the number of events that came into the interceptor – you can drop but can’t add events (which would go over the transaction capacity)
Interceptor never returns null b/c it’s passed to next interceptor or channel
File channel is the recommended channel: reliable channel (no data loss in outage), scales linearly with additional spindles (more disks, better performance), better durability guarantees than memory channelMemory channel can’t scale to large capacity because bound by memoryJDBC not recommended due to slow performance (don’t mention deadlock)
Recommended to use three disks: one disk for checkpointing and two disks for dataKeep-alive – wait 3 seconds for the blocks to free up – usually only used in high stress environments
Three files: checkpoint file (memory mapped by flume event queue), log1 and log2Checkpoint file = FE QIf you lose FEQ, you don’t lose data since it’s in the log files but takes a long time to remap data into memoryChannel’s main operations are done on top of flume event queue, which is a queue of pointers which point to different locations and different log filesFEQ is queue of active data that exists within file channel and contains reference count of filesEach log file contains metadata of itself – write-ahead log, not direct serialization of dataFEQ doesn’t store data, size of your events don’t impact the FEQ
Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
Polling semantics – sink continually polls to see if events are availableAsynchbase sink recommended over hbase sink (synchronous hbaseapi) for better performanceNull sink will drop events to the floor
Groups active sinks together and then adds a processorLoad_balance - shipped w round robin and random distribution and back off – but you can write your own selection algorithm and plug it into the sink processorFailover supports round robin, random, and back off (won’t try failed sink until back off time period is over)
Interface that exposes itisActive can be used for testingThis is a way of getting data into flumeClient can talk to flume’s avro/thrift source