This document discusses Apache Kafka and message queuing systems. It provides an overview of Kafka, including how producers and consumers work, and details on topics, partitions, and Zookeeper. It then discusses performance, production issues, and what improvements are planned for future Kafka releases. The document also reviews the Kafka community and integrations with other technologies.
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021StreamNative
Modern IT and application environments are increasingly complex, transitioning to cloud, and large in scale. The managed resources, services and applications in these environments generate tremendous data that needs to be observed, consumed and analyzed in real time (or later) by management tools to create insights and to drive operational actions and decisions.
In this talk, Srikanth Natarajan will share Micro Focus’ adoption story of Pulsar, including the experience in consuming from and contributing to Apache Pulsar, the lessons learned, and the help that Micro Focus received from a development support partner in their Pulsar journey.
Building the Next-Generation Messaging Platform on Pulsar at Intuit - Pulsar ...StreamNative
Intuit operates a highly distributed, multi-cluster, multi-region messaging platform to serve the queuing use-cases of its applications and services. In this session we will talk about the journey of our messaging platform in the world of Apache Pulsar and share the experiences and learnings gained so far. As we adopted Pulsar for our next generation platform and adapted it for Intuit specific requirements, we faced and solved some intrinsic challenges that we would be happy to share and get feedback. This journey has just begun and we would like to learn and absorb recommended best practices and guidelines.
The document discusses security models in Apache Kafka. It describes the PLAINTEXT, SSL, SASL_PLAINTEXT and SASL_SSL security models, covering authentication, authorization, and encryption capabilities. It also provides tips on troubleshooting security issues, including enabling debug logs, and common errors seen with Kafka security.
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
This document discusses TELUS's journey to enable real-time streaming analytics of data from IPTV set top boxes (STBs) to improve the customer experience. It describes moving from batch processing STB log data every 12 hours to streaming the data in real-time using Apache Kafka, NiFi, and Spark. Key lessons learned include using Java 8 for SSL, Spark 2.0 for Kafka integration, and addressing security challenges in their multi-tenant Hadoop environment.
NephOS is an end-to-end cloud software stack that provides infrastructure as a service capabilities for enterprises and OEMs. It offers high performance, advanced security and reliability, and operational efficiency. Key features include self-service provisioning, scalability to millions of transactions per day, integrated automation and build systems, and flexible billing options. NephOS uses OpenStack for object storage but provides additional out-of-the-box capabilities for both virtual and dedicated servers, user interfaces, billing support, and inventory management.
Experience with adapting a WS-BPEL runtime for eScience workflowsThilina Gunarathne
Scientists believe in the concept of collective intelligence and are increasingly collaborating with their peers, sharing data and simulation techniques. These collaborations are made possible by building eScience infrastructures. eScience infrastructures build and assemble various scientific workflow and data management tools which provide rich end user functionality while abstracting the complexities of many underlying technologies. For instance, workflow systems provide a means to execute complex sequence of tasks with or without intensive user intervention and in ways that support flexible reordering and reconfiguration of the workflow. As the workflow technologies continue to emerge, the need for interoperability and standardization clamorous. The Web Services Business Process Execution Language (WS-BPEL) provides one such standard way of defining workflows. WS-BPEL specification encompasses broad range of workflow composition and description capabilities that can be applied to both abstract as well as concrete executable components.
Scientific workflows with their agile characteristics present significant challenges in embracing WS-BPEL for eScience purposes. In this paper we discuss the experiences in adopting a WS-BPEL runtime within an eScience infrastructure with reference to an early implementation of a custom eScience motivated BPEL like workflow engine. Specifically the paper focuses on replacing the early adopter research system with a widely used open source WS-BPEL runtime, Apache ODE, while retaining the interoperable design to switch to any WS-BPEL compliant workflow runtime in future. The paper discusses the challenges encountered in extending a business motivated workflow engine for scientific workflow executions. Further, the paper presents performance benchmarks for the developed system.
This document discusses integrating Docker containers with YARN by introducing a Docker container runtime to the LinuxContainerExecutor in YARN. The DockerContainerRuntime allows YARN to leverage Docker for container lifecycle management and supports features like resource isolation, Linux capabilities, privileged containers, users, networking and images. It remains a work in progress to support additional features around networking, users and images fully.
This document discusses Microsoft's use of Apache YARN for scale-out resource management. It describes how YARN is used to manage vast amounts of data and compute resources across many different applications and workloads. The document outlines some limitations of YARN and Microsoft's contributions to address those limitations, including Rayon for improved scheduling, Mercury and Yaq for distributed scheduling, and work on federation to scale YARN across multiple clusters. It provides details on the implementation and evaluation of these contributions through papers, JIRAs, and integration into Apache Hadoop releases.
Why Micro Focus Chose Pulsar for Data Ingestion - Pulsar Summit NA 2021StreamNative
Modern IT and application environments are increasingly complex, transitioning to cloud, and large in scale. The managed resources, services and applications in these environments generate tremendous data that needs to be observed, consumed and analyzed in real time (or later) by management tools to create insights and to drive operational actions and decisions.
In this talk, Srikanth Natarajan will share Micro Focus’ adoption story of Pulsar, including the experience in consuming from and contributing to Apache Pulsar, the lessons learned, and the help that Micro Focus received from a development support partner in their Pulsar journey.
Building the Next-Generation Messaging Platform on Pulsar at Intuit - Pulsar ...StreamNative
Intuit operates a highly distributed, multi-cluster, multi-region messaging platform to serve the queuing use-cases of its applications and services. In this session we will talk about the journey of our messaging platform in the world of Apache Pulsar and share the experiences and learnings gained so far. As we adopted Pulsar for our next generation platform and adapted it for Intuit specific requirements, we faced and solved some intrinsic challenges that we would be happy to share and get feedback. This journey has just begun and we would like to learn and absorb recommended best practices and guidelines.
The document discusses security models in Apache Kafka. It describes the PLAINTEXT, SSL, SASL_PLAINTEXT and SASL_SSL security models, covering authentication, authorization, and encryption capabilities. It also provides tips on troubleshooting security issues, including enabling debug logs, and common errors seen with Kafka security.
Bringing Real-Time to the Enterprise with Hortonworks DataFlowDataWorks Summit
This document discusses TELUS's journey to enable real-time streaming analytics of data from IPTV set top boxes (STBs) to improve the customer experience. It describes moving from batch processing STB log data every 12 hours to streaming the data in real-time using Apache Kafka, NiFi, and Spark. Key lessons learned include using Java 8 for SSL, Spark 2.0 for Kafka integration, and addressing security challenges in their multi-tenant Hadoop environment.
NephOS is an end-to-end cloud software stack that provides infrastructure as a service capabilities for enterprises and OEMs. It offers high performance, advanced security and reliability, and operational efficiency. Key features include self-service provisioning, scalability to millions of transactions per day, integrated automation and build systems, and flexible billing options. NephOS uses OpenStack for object storage but provides additional out-of-the-box capabilities for both virtual and dedicated servers, user interfaces, billing support, and inventory management.
Experience with adapting a WS-BPEL runtime for eScience workflowsThilina Gunarathne
Scientists believe in the concept of collective intelligence and are increasingly collaborating with their peers, sharing data and simulation techniques. These collaborations are made possible by building eScience infrastructures. eScience infrastructures build and assemble various scientific workflow and data management tools which provide rich end user functionality while abstracting the complexities of many underlying technologies. For instance, workflow systems provide a means to execute complex sequence of tasks with or without intensive user intervention and in ways that support flexible reordering and reconfiguration of the workflow. As the workflow technologies continue to emerge, the need for interoperability and standardization clamorous. The Web Services Business Process Execution Language (WS-BPEL) provides one such standard way of defining workflows. WS-BPEL specification encompasses broad range of workflow composition and description capabilities that can be applied to both abstract as well as concrete executable components.
Scientific workflows with their agile characteristics present significant challenges in embracing WS-BPEL for eScience purposes. In this paper we discuss the experiences in adopting a WS-BPEL runtime within an eScience infrastructure with reference to an early implementation of a custom eScience motivated BPEL like workflow engine. Specifically the paper focuses on replacing the early adopter research system with a widely used open source WS-BPEL runtime, Apache ODE, while retaining the interoperable design to switch to any WS-BPEL compliant workflow runtime in future. The paper discusses the challenges encountered in extending a business motivated workflow engine for scientific workflow executions. Further, the paper presents performance benchmarks for the developed system.
This document discusses integrating Docker containers with YARN by introducing a Docker container runtime to the LinuxContainerExecutor in YARN. The DockerContainerRuntime allows YARN to leverage Docker for container lifecycle management and supports features like resource isolation, Linux capabilities, privileged containers, users, networking and images. It remains a work in progress to support additional features around networking, users and images fully.
This document discusses Microsoft's use of Apache YARN for scale-out resource management. It describes how YARN is used to manage vast amounts of data and compute resources across many different applications and workloads. The document outlines some limitations of YARN and Microsoft's contributions to address those limitations, including Rayon for improved scheduling, Mercury and Yaq for distributed scheduling, and work on federation to scale YARN across multiple clusters. It provides details on the implementation and evaluation of these contributions through papers, JIRAs, and integration into Apache Hadoop releases.
This document provides an introduction and overview of key concepts for Apache Kafka. It discusses Kafka's architecture as a distributed streaming platform consisting of producers, brokers, consumers and topics partitioned into logs. It covers Kafka's high throughput and low latency capabilities through batching and zero-copy I/O. The document also outlines Kafka's guarantees around message ordering, delivery semantics, and how consumer groups work to partition data streams across consumer instances.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
Real-life Machine Learning (ML) workloads typically require more than training and predicting: data often needs to be pre-processed and post-processed, sometimes in multiple steps. Thus, developers and data scientists have to train and deploy not just a single algorithm, but a sequence of algorithms that will collaborate in delivering predictions from raw data. In this session, we’ll first show you how to use Apache Spark MLlib to build ML pipelines, and we’ll discuss scaling options when datasets grow huge. We’ll then show how to how implement inference pipelines on Amazon SageMaker, using Apache Spark, Scikit-learn, as well as ML algorithms implemented by Amazon.
The document discusses a TechTalk webinar on hyperconverged infrastructure from Cisco Thailand that includes a live demo. It provides definitions and explanations of key concepts like hyperconvergence, software defined storage, and hyperconverged architectures. The webinar highlights benefits like agility, efficiency, simplicity and scalability and discusses how hyperconvergence is shifting the market towards server-based ecosystems.
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
In a modern society, mobile networks has become one of the most important infrastructure components. The availability of a mobile network has become even essential in areas like health care and machine to machine communication.
In 2016, Telefónica Germany begun the Customer Experience Management (CEM) project to get KPI out of the mobile network describing the participant’s experience while using the Telefónica’s mobile network. These KPI help to plan and create a better mobile network where improvements are indicated.
Telefónica is using Hortonworks HDF solution to ingest 16 billion records a day which are generated by CEM. To achieve the best out of HDF abilities some customizations have been made:
1.) Custom processors have been written to comply with data privacy rules.
2.) Nifi is running in Docker containers within a Kubernetes cluster to increase reliability of the ingestion system.
Finally, the data is presented in Hive tables and Kafka topics to be further processed. In this talk, we will present the CEM use case and how it is technically implemented as stated in (1) and (2). Most interesting part for the audience should be our experiences we have made using HDF in a Docker/Kubernetes environment since this solution is not yet officially supported.
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.
The document discusses TiNA, an integrated network analyzer developed by SK Telecom to provide unified network monitoring and operation for software-defined data centers. TiNA includes systems for network packet brokering, probing, analysis, visualization, and service-centric monitoring. It provides both packet-level and flow-level network analytics using open source software and the T-CAP, an open converged network appliance developed by SKT that integrates switching and server functions. The document outlines TiNA's capabilities and provides examples of its use for traffic engineering, cloud data center multi-tenancy monitoring, and LTE network monitoring.
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
The document discusses optimizing an Apache Pulsar deployment to handle 10 PB of data per day for a large customer. It estimates the initial cluster size needed using different storage options in Google Cloud Platform. It then describes four optimizations made - eliminating the journal, using direct I/O, compression, and improving the C++ client - and recalculates the cluster size after each optimization. The optimized deployment uses 200 VMs each with 24 local SSDs to meet the requirements.
A short tech show on how to achieve VM HA by integrating Heat, Ceilometer and Nova; and another show about deploying a cluster of VMs across multiple regions then scale it.
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
1. The document discusses various architectures for running Kafka in a multi-datacenter environment including running Kafka natively in multiple datacenters, mirroring data between datacenters, and using hierarchical Zookeeper quorums.
2. Key considerations for multi-DC Kafka include replication settings, consumer reconfiguration needs during outages, and handling consumer offsets and processing state across datacenters.
3. Native multi-DC Kafka is preferred but mirroring can be an alternative approach for inter-region traffic when latency is over 30ms or datacenters cannot be combined into a single cluster. Asynchronous mirroring acts differently than a single Kafka cluster and impacts operations.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
This document discusses disaster recovery strategies for Apache Kafka clusters running across multiple data centers. It outlines several failure scenarios like an entire data center being demolished and recommends solutions like running a single Kafka cluster across multiple near-by data centers. It then describes a "stretch cluster" approach using 3 data centers with replication between them to provide high availability. The document also discusses active-active replication between two data center clusters and challenges around consumer offsets not being identical across data centers during a failover. It recommends approaches like tracking timestamps and failing over consumers based on time.
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One
Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways.
Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space.
Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka.
-Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers.
-Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions.
-Understand how Capital One manages Kafka Docker containers using Kubernetes.
Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.
This document discusses Cloudera's initiative to make Spark the standard execution engine for Hadoop. It outlines how Spark improves on MapReduce by leveraging distributed memory and having a simpler developer experience. It also describes Cloudera's investments in areas like management, security, scale, and streaming to further Spark's capabilities and make it production-ready. The goal is for Spark to replace MapReduce as the execution engine and for specialized engines like Impala to handle specific workloads, with all sharing the same data, metadata, resource management, and other platform services.
This document provides an introduction and overview of key concepts for Apache Kafka. It discusses Kafka's architecture as a distributed streaming platform consisting of producers, brokers, consumers and topics partitioned into logs. It covers Kafka's high throughput and low latency capabilities through batching and zero-copy I/O. The document also outlines Kafka's guarantees around message ordering, delivery semantics, and how consumer groups work to partition data streams across consumer instances.
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
To manage the ever-increasing volume and velocity of data within your company, you have successfully made the transition from single machines and one-off solutions to large distributed stream infrastructures in your data center, powered by Apache Kafka. But what if one data center is not enough? I will describe building resilient data pipelines with Apache Kafka that span multiple data centers and points of presence, and provide an overview of best practices and common patterns while covering key areas such as architecture guidelines, data replication, and mirroring as well as disaster scenarios and failure handling.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
This document provides guidance on scaling Apache Kafka clusters and tuning performance. It discusses expanding Kafka clusters horizontally across inexpensive servers for increased throughput and CPU utilization. Key aspects that impact performance like disk layout, OS tuning, Java settings, broker and topic monitoring, client tuning, and anticipating problems are covered. Application performance can be improved through configuration of batch size, compression, and request handling, while consumer performance relies on partitioning, fetch settings, and avoiding perpetual rebalances.
Real time Messages at Scale with Apache Kafka and CouchbaseWill Gardella
Kafka is a scalable, distributed publish subscribe messaging system that's used as a data transmission backbone in many data intensive digital businesses. Couchbase Server is a scalable, flexible document database that's fast, agile, and elastic. Because they both appeal to the same type of customers, Couchbase and Kafka are often used together.
This presentation from a meetup in Mountain View describes Kafka's design and why people use it, Couchbase Server and its uses, and the use cases for both together. Also covered is a description and demo of Couchbase Server writing documents to a Kafka topic and consuming messages from a Kafka topic. using the Couchbase Kafka Connector.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Building Machine Learning inference pipelines at scale | AWS Summit Tel Aviv ...AWS Summits
Real-life Machine Learning (ML) workloads typically require more than training and predicting: data often needs to be pre-processed and post-processed, sometimes in multiple steps. Thus, developers and data scientists have to train and deploy not just a single algorithm, but a sequence of algorithms that will collaborate in delivering predictions from raw data. In this session, we’ll first show you how to use Apache Spark MLlib to build ML pipelines, and we’ll discuss scaling options when datasets grow huge. We’ll then show how to how implement inference pipelines on Amazon SageMaker, using Apache Spark, Scikit-learn, as well as ML algorithms implemented by Amazon.
The document discusses a TechTalk webinar on hyperconverged infrastructure from Cisco Thailand that includes a live demo. It provides definitions and explanations of key concepts like hyperconvergence, software defined storage, and hyperconverged architectures. The webinar highlights benefits like agility, efficiency, simplicity and scalability and discusses how hyperconvergence is shifting the market towards server-based ecosystems.
How to Ingest 16 Billion Records Per Day into your Hadoop EnvironmentDataWorks Summit
In a modern society, mobile networks has become one of the most important infrastructure components. The availability of a mobile network has become even essential in areas like health care and machine to machine communication.
In 2016, Telefónica Germany begun the Customer Experience Management (CEM) project to get KPI out of the mobile network describing the participant’s experience while using the Telefónica’s mobile network. These KPI help to plan and create a better mobile network where improvements are indicated.
Telefónica is using Hortonworks HDF solution to ingest 16 billion records a day which are generated by CEM. To achieve the best out of HDF abilities some customizations have been made:
1.) Custom processors have been written to comply with data privacy rules.
2.) Nifi is running in Docker containers within a Kubernetes cluster to increase reliability of the ingestion system.
Finally, the data is presented in Hive tables and Kafka topics to be further processed. In this talk, we will present the CEM use case and how it is technically implemented as stated in (1) and (2). Most interesting part for the audience should be our experiences we have made using HDF in a Docker/Kubernetes environment since this solution is not yet officially supported.
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.
The document discusses TiNA, an integrated network analyzer developed by SK Telecom to provide unified network monitoring and operation for software-defined data centers. TiNA includes systems for network packet brokering, probing, analysis, visualization, and service-centric monitoring. It provides both packet-level and flow-level network analytics using open source software and the T-CAP, an open converged network appliance developed by SKT that integrates switching and server functions. The document outlines TiNA's capabilities and provides examples of its use for traffic engineering, cloud data center multi-tenancy monitoring, and LTE network monitoring.
Lessons Learned Running Hadoop and Spark in Docker ContainersBlueData, Inc.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale production environments poses interesting challenges, especially when deploying distributed big data applications like Apache Hadoop and Apache Spark. This session at Strata + Hadoop World in New York City (September 2016) explores various solutions and tips to address the challenges encountered while deploying multi-node Hadoop and Spark production workloads using Docker containers.
Some of these challenges include container life-cycle management, smart scheduling for optimal resource utilization, network configuration and security, and performance. BlueData is "all in” on Docker containers—with a specific focus on big data applications. BlueData has learned firsthand how to address these challenges for Fortune 500 enterprises and government organizations that want to deploy big data workloads using Docker.
This session by Thomas Phelan, co-founder and chief architect at BlueData, discusses how to securely network Docker containers across multiple hosts and discusses ways to achieve high availability across distributed big data applications and hosts in your data center. Since we’re talking about very large volumes of data, performance is a key factor, so Thomas shares some of the storage options implemented at BlueData to achieve near bare-metal I/O performance for Hadoop and Spark using Docker as well as lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/detail/52042
The document discusses optimizing an Apache Pulsar deployment to handle 10 PB of data per day for a large customer. It estimates the initial cluster size needed using different storage options in Google Cloud Platform. It then describes four optimizations made - eliminating the journal, using direct I/O, compression, and improving the C++ client - and recalculates the cluster size after each optimization. The optimized deployment uses 200 VMs each with 24 local SSDs to meet the requirements.
A short tech show on how to achieve VM HA by integrating Heat, Ceilometer and Nova; and another show about deploying a cluster of VMs across multiple regions then scale it.
The Foundations of Multi-DC Kafka (Jakub Korab, Solutions Architect, Confluen...confluent
1. The document discusses various architectures for running Kafka in a multi-datacenter environment including running Kafka natively in multiple datacenters, mirroring data between datacenters, and using hierarchical Zookeeper quorums.
2. Key considerations for multi-DC Kafka include replication settings, consumer reconfiguration needs during outages, and handling consumer offsets and processing state across datacenters.
3. Native multi-DC Kafka is preferred but mirroring can be an alternative approach for inter-region traffic when latency is over 30ms or datacenters cannot be combined into a single cluster. Asynchronous mirroring acts differently than a single Kafka cluster and impacts operations.
Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Gwen (Chen) Shapira
This document discusses disaster recovery strategies for Apache Kafka clusters running across multiple data centers. It outlines several failure scenarios like an entire data center being demolished and recommends solutions like running a single Kafka cluster across multiple near-by data centers. It then describes a "stretch cluster" approach using 3 data centers with replication between them to provide high availability. The document also discusses active-active replication between two data center clusters and challenges around consumer offsets not being identical across data centers during a failover. It recommends approaches like tracking timestamps and failing over consumers based on time.
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
Speakers: Ravi Dubey, Senior Manager, Software Engineering, Capital One + Jeff Sharpe, Software Engineer, Capital One
Capital One supports interactions with real-time streaming transactional data using Apache Kafka®. Kafka helps deliver information to internal operation teams and bank tellers to assist with assessing risk and protect customers in a myriad of ways.
Inside the bank, Kafka allows Capital One to build a real-time system that takes advantage of modern data and cloud technologies without exposing customers to unnecessary data breaches, or violating privacy regulations. These examples demonstrate how a streaming platform enables Capital One to act on their visions faster and in a more scalable way through the Kafka solution, helping establish Capital One as an innovator in the banking space.
Join us for this online talk on lessons learned, best practices and technical patterns of Capital One’s deployment of Apache Kafka.
-Find out how Kafka delivers on a 5-second service-level agreement (SLA) for inside branch tellers.
-Learn how to combine and host data in-memory and prevent personally identifiable information (PII) violations of in-flight transactions.
-Understand how Capital One manages Kafka Docker containers using Kubernetes.
Watch the recording: https://videos.confluent.io/watch/6e6ukQNnmASwkf9Gkdhh69?.
This document discusses Cloudera's initiative to make Spark the standard execution engine for Hadoop. It outlines how Spark improves on MapReduce by leveraging distributed memory and having a simpler developer experience. It also describes Cloudera's investments in areas like management, security, scale, and streaming to further Spark's capabilities and make it production-ready. The goal is for Spark to replace MapReduce as the execution engine and for specialized engines like Impala to handle specific workloads, with all sharing the same data, metadata, resource management, and other platform services.
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Christopher Curtin
Chris Curtin gave a presentation on Apache Kafka at the Atlanta Java Users Group. He discussed his background in technology and current role at Silverpop. He then provided an overview of Apache Kafka, describing its core functionality as a distributed publish-subscribe messaging system. Finally, he demonstrated how producers and consumers interact with Kafka and highlighted some use cases and performance figures from LinkedIn's deployment of Kafka.
Data is being generated at a feverish pace and many businesses want all of it at their disposal to solve complex strategic problems. As decision making moves to real-time, enterprises need data ready for analysis immediately. Sean Anderson and Amandeep Khurana will discuss common pipeline trends in modern streaming architectures, Hadoop components that enable streaming capabilities, and popular use cases that are enabling the world of IOT and real-time data science.
Ted Dunning is the Chief Applications Architect at MapR Technologies and a committer for Apache Drill, Zookeeper, and other projects. The document discusses goals around real-time or near-time processing and microservices. It describes how to design microservices for isolation using self-describing data, private databases, and shared storage only where necessary. Various scenarios involving fraud detection, IoT data aggregation, and global data recovery are presented. Lessons focus on decoupling services, propagating events rather than table updates, and how data architecture should reflect business structure.
This is the talk I gave at the Seattle Spark Meetup in March, 2015. I discussed some Spark Streaming fundamentals, integration points with Kafka, Flume etc.
HPC and cloud distributed computing, as a journeyPeter Clapham
Introducing an internal cloud brings new paradigms, tools and infrastructure management. When placed alongside traditional HPC the new opportunities are significant But getting to the new world with micro-services, autoscaling and autodialing is a journey that cannot be achieved in a single step.
This document provides an overview and agenda for a presentation on Apache Kafka. The presentation will cover Kafka concepts and architecture, how it compares to traditional messaging systems, using Kafka with Cloudera, and a demo of installing and configuring Kafka on a Cloudera cluster. It will also discuss Kafka's role in ingestion pipelines and data integration use cases.
YARN Containerized Services: Fading The Lines Between On-Prem And CloudDataWorks Summit
Apache Hadoop YARN is the modern distributed operating system for big data applications. In Apache Hadoop 3.1.0, YARN added a service framework that supports long-running services. This new capability goes hand in hand with the recent improvements in YARN to support Docker containers. Together these features have made it significantly easier to bring new applications and services to YARN.
In this talk you will learn about YARN service framework, its new containerization capabilities and how it lays the foundation for a hybrid and uniform architecture for compute and storage across on-prem and multi-cloud environments. This will include examples highlighting how easy it is to bring applications to the YARN service framework as well as how to containerize applications.
Here's what to expect in this talk:
- Motivation for YARN service framework and containerization
- YARN service framework overview
- YARN service examples
- Containerization overview
- Containerization for Big Data and non Big Data workloads - wait that's everything
The document provides an agenda and overview of a session on hacking Apache CloudStack. The agenda includes introductions, a session on introducing CloudStack, and a hands-on session with DevCloud. The overview discusses what CloudStack is, how it works as an orchestration platform for IAAS clouds, its architecture and core components, and how users can consume and manage resources through it.
This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
This document discusses building event streaming architectures using Scylla and Confluent Kafka. It provides an overview of Scylla and how it can be used with Kafka at Numberly. It then discusses change data capture (CDC) in Scylla and how to stream data from Scylla to Kafka using Kafka Connect and the Scylla source connector. The Kafka Connect framework and connectors allow capturing changes from Scylla tables in Kafka topics to power downstream applications and tasks.
Apache Geode Meetup, Cork, Ireland at CITApache Geode
This document provides an introduction to Apache Geode (incubating), including:
- A brief history of Geode and why it was developed
- An overview of key Geode concepts such as regions, caching, and functions
- Examples of interesting large-scale use cases from companies like Indian Railways
- A demonstration of using Geode with Apache Spark and Spring XD for a stock prediction application
- Information on how to get involved with the Geode open source project community
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
This document discusses building near-real-time analytics pipelines using Apache Spark Streaming and Apache Kudu on the Cloudera platform. It defines near-real-time analytics, describes the relevant components of the Cloudera stack (Kafka, Spark, Kudu, Impala), and how they can work together. The document then outlines the typical stages involved in implementing a Spark Streaming to Kudu pipeline, including sourcing from a queue, translating data, deriving storage records, planning mutations, and storing the data. It provides performance considerations and introduces Envelope, a Spark Streaming application on Cloudera Labs that implements these stages through configurable pipelines.
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
Abstract:-
With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Bio:-
Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
The document discusses real-time fraud detection patterns and architectures. It provides an overview of key technologies like Kafka, Flume, and Spark Streaming used for real-time event processing. It then describes a high-level architecture involving ingesting events through Flume and Kafka into Spark Streaming for real-time processing, with results stored in HBase, HDFS, and Solr. The document also covers partitioning strategies, micro-batching, complex topologies, and ingestion of real-time and batch data.
This presentation provides an introduction to Apache Kafka and describes best practices for working with fast data streams in Kafka and MapR Streams.
The code examples used during this talk are available at github.com/iandow/design-patterns-for-fast-data.
Author:
Ian Downard
Presented at the Portland Java User Group on Tuesday, October 18 2016.
Jerry-rigged piping between systems and applications on an as needed basis
A lot of times there is Impedance mismatch and We typically deploy asynchronous processing to encounter it i.e request-response web services for any downstream processing.
This approach is very ad hoc and Over time this set-up gets more and more complex.
Data becomes unreliable and data quality suffers
Kafka Decouples data pipelines
Central repository of data streams
Takes care of any impedance mismatch between different applications and the analytics necessary
Enables Lambda Architecture seamlessly – tried and tested extensively in Linkedin
This system has its limitations - as code has to be written twice in the realtime layer and batch layer
There was a blog by Jay Kreps who founded a new company called Confluent talking about Kappa Architecture – http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
twitter summingbird is a complex system which solves and addresses these issues - https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
Producers - ** push **
Batching
Compression
Sync (Ack), Async (auto batch)
Sequential writes, guaranteed ordering within each partition
Consumers - ** pull **
No state held by broker
Consumers control reading from the stream
* compression.codec – uses GZIP and Snappy for message compression from the producer to the broker
Zero Copy for producers and consumers to and from the broker - Zero copy is a function in file channel (java NIO) which lets you avoid redundant data copies between intermediate buffers & reduces the number of context switches between user space and kernel space. http://kafka.apache.org/documentation.html#maximizingefficiency
Messages stay on disk when consumed, deletes on TTL or compaction https://kafka.apache.org/documentation.html#compaction
* Partitions are sequential writes
partition count is higher than the number of brokers, so that the leader partitions are evenly distributed across brokers, thus distributing the read/write load.
While configuring a Producer, set acks=-1. message is considered to be successfully delivered only after ALL the ISRs have acknowledged writing the message.
Set the topic level configuration min.insync.replicas, which specifies the number of replicas that must acknowledge a write, for the write to be considered successful. If this minimum cannot be met, and acks=-1, the producer will raise an exception.
Set the broker configuration param unclean.leader.election.enable to false. This setting essentially means you are prioritizing durability over availability since Kafka would avoid electing a leader, and instead make the partition unavailable, if no ISR is available to become the next leader safely.
Failure cases:
Leader failure
* leader – controller
* Zookeeper has an ephemeral node to which the Leader/controller is subscribed to, when the leader fails notifies the controller
2) Follower failure
* What happens when a broker goes down
* How are new messages handled
* What happens when the broker which is down is up and ready to join the ISR
No UI which gives visibility into # of topics, Partitions, Consumers/ Consumer groups, name and number of topics - log retention (time/capacity), Current log size, Consumer groups IDs, partitions, offset, lag, throughput, owner
KAFKA-1890 - Fix bug preventing Mirror Maker from successful rebalance
* Expanding your cluster:
Just assign a broker ID and start Kafka on a new node.
Partitions are not automatically assigned.
Partitions have to be manually migrated -> Partition re-assignment tool which generates, executes a custom re-assignment plan and verifies the status.
* Decommissioning Brokers: Custom re-assignment plan to move all the replicas to more than one broker (evenly Distributed)
While most of the focus for 0.6 and 0.7 have been Scalability, Atleast one semantics, Strong enough gaurantees, doesn’t fallover, persistance and effeciency
Non blocking IO for the producer – Flush logic to completely move to the background
Better durability & consistency controls
Better Partitioning
Security https://cwiki.apache.org/confluence/display/KAFKA/Security
Authentication
TLS/SSL
Kerberos
Authorization
Plugable
Yahoo manager for Kafka –
* Manage multiple clusters
Easy inspection of the state of the cluster
Takes care of partition assignments and reassignments (based on current state of cluster)
* varnishkafka is a varnish log collector with an integrated Apache Kafka producer. It was written from scratch with performance and modularity in mind, varnishkafka consumes about a third of the CPU that varnishncsa does and has a far more frugal memory approach.
* kafkatee consumes messages from one or more Kafka topics and writes the messages to one or more outputs - either command pipes or files.
* Opportunity No well recognized Hbase Connector with Kafka
Streaming support, Reliability, guaranteed ordering of messages in a partition, Scalability, runs on a Distributed infrastructure, Persistence, compression