This document provides an overview of Apache Kafka and Apache Spark Streaming and their integration. It discusses what Kafka and Spark Streaming are, how they work, their benefits, and semantics when used together. It also provides examples of code for using the new Kafka integration in Spark 2.0+, including getting metadata, storing offsets in Kafka, and achieving at-most-once, at-least-once, and exactly-once processing semantics. Finally, it shares some insights into how Billy Mobile uses Spark Streaming with Kafka to process large volumes of data.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/
Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake
Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding.
|WHAT THE SESSION WILL COVER|
Best Practices for using Pulsar and NiFi
A deep dive on Apache NiFi's Pulsar connector and demos
Building an End-to-End Application in the Hybrid Cloud
Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing!
—------------------------
|AGENDA|
6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate
7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer
8:00 - 8:30 PM EST: Q&A + Networking
—------------------------
|ABOUT THE SPEAKERS|
John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data.
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
Apache Kafka and Apache Pulsar are both popular messaging frameworks. Apache Kafka has a big user base and people will want to know how Kafka and Pulsar are either the same or different in many respects. This talk will cover the key differences and how Pulsar adds new features that missing in Kafka.
We will cover:
The architectural differences and similarities in Pulsar and Kafka. Show use of BookKeeper and what that allows.
The Producer API and functionality differences. Show HelloWorld for both.
The Consumer API and functionality differences. Show HelloWorld for both.
The core use case and functionality differences. Show Pulsar as handling all of Kafka’s use cases and new ones that aren’t possible with Kafka.
This talk will allow people who are choosing between Kafka and Pulsar to have a more accurate and in-depth understanding of the differences between them. For companies considering a switch from Kafka to Pulsar, this talk will give them the cheatsheet to go back and make a more informed decision.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/
Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake
Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding.
|WHAT THE SESSION WILL COVER|
Best Practices for using Pulsar and NiFi
A deep dive on Apache NiFi's Pulsar connector and demos
Building an End-to-End Application in the Hybrid Cloud
Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing!
—------------------------
|AGENDA|
6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate
7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer
8:00 - 8:30 PM EST: Q&A + Networking
—------------------------
|ABOUT THE SPEAKERS|
John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data.
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
Apache Kafka and Apache Pulsar are both popular messaging frameworks. Apache Kafka has a big user base and people will want to know how Kafka and Pulsar are either the same or different in many respects. This talk will cover the key differences and how Pulsar adds new features that missing in Kafka.
We will cover:
The architectural differences and similarities in Pulsar and Kafka. Show use of BookKeeper and what that allows.
The Producer API and functionality differences. Show HelloWorld for both.
The Consumer API and functionality differences. Show HelloWorld for both.
The core use case and functionality differences. Show Pulsar as handling all of Kafka’s use cases and new ones that aren’t possible with Kafka.
This talk will allow people who are choosing between Kafka and Pulsar to have a more accurate and in-depth understanding of the differences between them. For companies considering a switch from Kafka to Pulsar, this talk will give them the cheatsheet to go back and make a more informed decision.
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Pulsar Functions Deep Dive_Sanjeev kulkarniStreamNative
Pulsar Functions provide a simple yet powerful way of interacting with Pulsar topics, transforming, enriching and analyzing data contained in the streams. And with pluggable runtime environments, one can run Pulsar functions as threads/processes managed by Pulsar, or as containers/pods managed by external schedulers like Kubernetes. This talk does into the deep weeds of the underlying concepts in its implementation. In particular we will talk about the concepts of Runtime and scheduler that manages Pulsar managed functions. We will also delve into current pitfalls and areas of improvement.
Technologies Referenced: Akka, Typesafe Reactive Platform
Technical Level: Introductory
Audience: Senior Developers, Architects
Presenter: Konrad Malawski, Akka Software Engineer, Typesafe, Inc.
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
Apache Pulsar is a distributed and open-source pub-sub messaging system. It offers many advantages over Kafka, such as multi-tenant, geo-replication, decoupled storage or even SQL and FaaS directly integrated. The only thing missing for wide adoption is support for the de-facto standard for streaming: Kafka. And this is how our story begins.
In this talk, Sijie Guo from StreamNative and Pierre Zemb from OVHcloud will share the journey on building Kafka-on-Pulsar (KoP) to bring native Kafka protocol support to Pulsar. Before joining the force on building KoP, OVHcloud implemented a Kafka proxy in Rust capable of transforming the Kafka protocol to that Pulsar on the fly and encountered some challenges. After realizing that StreamNative was working on bringing the Kafka protocol natively to Pulsar broker via a pluggable protocol handler mechanism. OVHCloud joined forces with StreamNative to work on brining Kafka protocol support to Pulsar brokers.
At the end of this talk, you will know more about the inner workings of Kafka and Pulsar. You'll also get feedback from both companies from their initial proofs of concepts and the current implementation.
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
Watch the presentation on-demand now: https://goo.gl/kceFTe
Today’s digital economy demands a new way of running business. Flexible access to information and responses in real time are essential for outpacing competition.
Watch this Denodo DataFest 2017 session to discover:
• Data access challenges faced by organizations today.
• How data virtualization facilitates real-time analytics.
• Key use cases and customer success stories.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Pulsar Functions Deep Dive_Sanjeev kulkarniStreamNative
Pulsar Functions provide a simple yet powerful way of interacting with Pulsar topics, transforming, enriching and analyzing data contained in the streams. And with pluggable runtime environments, one can run Pulsar functions as threads/processes managed by Pulsar, or as containers/pods managed by external schedulers like Kubernetes. This talk does into the deep weeds of the underlying concepts in its implementation. In particular we will talk about the concepts of Runtime and scheduler that manages Pulsar managed functions. We will also delve into current pitfalls and areas of improvement.
Technologies Referenced: Akka, Typesafe Reactive Platform
Technical Level: Introductory
Audience: Senior Developers, Architects
Presenter: Konrad Malawski, Akka Software Engineer, Typesafe, Inc.
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreStreamNative
Apache Pulsar is a distributed and open-source pub-sub messaging system. It offers many advantages over Kafka, such as multi-tenant, geo-replication, decoupled storage or even SQL and FaaS directly integrated. The only thing missing for wide adoption is support for the de-facto standard for streaming: Kafka. And this is how our story begins.
In this talk, Sijie Guo from StreamNative and Pierre Zemb from OVHcloud will share the journey on building Kafka-on-Pulsar (KoP) to bring native Kafka protocol support to Pulsar. Before joining the force on building KoP, OVHcloud implemented a Kafka proxy in Rust capable of transforming the Kafka protocol to that Pulsar on the fly and encountered some challenges. After realizing that StreamNative was working on bringing the Kafka protocol natively to Pulsar broker via a pluggable protocol handler mechanism. OVHCloud joined forces with StreamNative to work on brining Kafka protocol support to Pulsar brokers.
At the end of this talk, you will know more about the inner workings of Kafka and Pulsar. You'll also get feedback from both companies from their initial proofs of concepts and the current implementation.
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
Watch the presentation on-demand now: https://goo.gl/kceFTe
Today’s digital economy demands a new way of running business. Flexible access to information and responses in real time are essential for outpacing competition.
Watch this Denodo DataFest 2017 session to discover:
• Data access challenges faced by organizations today.
• How data virtualization facilitates real-time analytics.
• Key use cases and customer success stories.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 8 ноября, 17:00
Тезисы:
http://www.highload.ru/2017/abstracts/2978.html
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
...
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 7 ноября, 14:00
Тезисы:
http://www.highload.ru/2017/abstracts/3031.html
Apache Kafka - довольно популярная опенсорс-платформа для обработки потоков сообщений. Абстракция распределенного лога, лежащая в основе Kafka, дает возможность использовать ее в качестве системы очередей, но при этом дает некоторые очень полезные преимущества, недоступные даже решениям ESB-уровня.
...
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo
Watch live presentation here: https://goo.gl/UcZEHU
Big data projects are becoming mature and consistent. However, they remain siloed compared to the enterprise data. In addition, now new streaming data needs to integrated as well.
Watch this Denodo DataFest 2017 session to discover:
• How big data projects can be combined with other enterprise data.
• How to integrate streaming data into the mix.
• Benefits of aggregating the data without having to move them into a centralized repository.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
An overview on various concepts used in data stream processing. Most of them are used for solving problems in the field of time, focussing on processing time compared to event time. The techniques shown include the Dataflow API as it was introduced by Google and the concepts of stream and table duality. But I will also come up with other problems like data lookup and deployment of streaming applications and various strategies on solving these problems.
In the end I will give a brief outline on the implementation status of those strategies in the popular streaming frameworks Apache Spark Streaming, Apache Flink and Kafka Streams.
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures.
Bio:-
John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Cloudera, Inc.
Unternehmen sind heutzutage in der Lage ihre Daten mit relativer Leichtigkeit aufzunehmen und zu verwalten. Die Herausforderung besteht nun darin, die verborgenen Muster in den Daten zu erkennen und diese zu verstehen, um einen Mehrwert zu generieren. Aufgrund der großen Datenmengen gelingt dies mit traditionelle Ansätzen zumeist nicht. Das Ergebnis: Organisationen kämpfen, um wirklich zu innovieren und sich zu differenzieren.
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself. So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples. Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.
https://www.bigdataspain.org/2017/talk/spark-streaming-kafka-0-10-an-integration-story
Big Data Spain 2017
16th - 17th Kinépolis Madrid
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
In the world of sensors and social media streams, the integration and handling of high-volume event streams is more important than ever. Events have to be handled both efficiently and reliably and often many consumers or systems are interested in all or part of the events. How do we make sure that all these event are accepted and forwarded in an efficient and reliable way? Apache Kafka, a distributed, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target can be of great help in such scenario.
This session introduces Apache Kafka and its place in a modern architecture, shows its integration with Oracle Stack and presents the Oracle Event Hub cloud service, the managed Kafka service.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
Presentation @ Oracle Code Berlin.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can we make sure that all these events are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amounts of messages between a source and a target. This session will start with an introduction of Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table.
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
Things were easier when all our data used to be offline, analyzed overnight in batches. Now our data is online, in motion, and generated constantly. For architects, developers and their businesses, this means that there is an urgent need for tools and applications that can deliver real-time (or near real-time) streaming ETL capabilities.
In this session by Konrad Malawski, author, speaker and Senior Akka Engineer at Lightbend, you will learn how to build these streaming ETL pipelines with Akka Streams, Alpakka and Apache Kafka, and why they matter to enterprises that are increasingly turning to streaming Fast Data applications.
Kafka, Apache Kafka evolved from an enterprise messaging system to a fully distributed streaming data platform (Kafka Core + Kafka Connect + Kafka Streams) for building streaming data pipelines and streaming data applications.
This talk, that I gave at the Chicago Java Users Group (CJUG) on June 8th 2017, is mainly focusing on Kafka Streams, a lightweight open source Java library for building stream processing applications on top of Kafka using Kafka topics as input/output.
You will learn more about the following:
1. Apache Kafka: a Streaming Data Platform
2. Overview of Kafka Streams: Before Kafka Streams? What is Kafka Streams? Why Kafka Streams? What are Kafka Streams key concepts? Kafka Streams APIs and code examples?
3. Writing, deploying and running your first Kafka Streams application
4. Code and Demo of an end-to-end Kafka-based Streaming Data Application
5. Where to go from here?
Unlocking the Power of Apache Kafka: How Kafka Listeners Facilitate Real-time...Denodo
Watch full webinar here: https://buff.ly/43PDVsz
In today's fast-paced, data-driven world, organizations need real-time data pipelines and streaming applications to make informed decisions. Apache Kafka, a distributed streaming platform, provides a powerful solution for building such applications and, at the same time, gives the ability to scale without downtime and to work with high volumes of data. At the heart of Apache Kafka lies Kafka Topics, which enable communication between clients and brokers in the Kafka cluster.
Join us for this session with Pooja Dusane, Data Engineer at Denodo where we will explore the critical role that Kafka listeners play in enabling connectivity to Kafka Topics. We'll dive deep into the technical details, discussing the key concepts of Kafka listeners, including their role in enabling real-time communication between consumers and producers. We'll also explore the various configuration options available for Kafka listeners and demonstrate how they can be customized to suit specific use cases.
Attend and Learn:
- The critical role that Kafka listeners play in enabling connectivity in Apache Kafka.
- Key concepts of Kafka listeners and how they enable real-time communication between clients and brokers.
- Configuration options available for Kafka listeners and how they can be customized to suit specific use cases.
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Apache Kafka - A modern Stream Processing PlatformGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
High level introduction to Kafka Connect and Kafka Streams, two components of the Apache Kafka open source framework. See the concepts, architecture and features.
Fast Streaming into Clickhouse with Apache PulsarTimothy Spann
https://github.com/tspannhw/SpeakerProfile/tree/main/2022/talks
Fast Streaming into Clickhouse with Apache Pulsar
https://github.com/tspannhw/FLiPC-FastStreamingIntoClickhouseWithApachePulsar
https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Meetup/events/285271332/
Fast Streaming into Clickhouse with Apache Pulsar - Meetup 2022
StreamNative - Apache Pulsar - Stream to Altinity Cloud - Clickhouse
May the 4th Be With You!
04-May-2022 Clickhosue Meetup
CREATE TABLE iotjetsonjson_local
(
uuid String,
camera String,
ipaddress String,
networktime String,
top1pct String,
top1 String,
cputemp String,
gputemp String,
gputempf String,
cputempf String,
runtime String,
host String,
filename String,
host_name String,
macaddress String,
te String,
systemtime String,
cpu String,
diskusage String,
memory String,
imageinput String
)
ENGINE = MergeTree()
PARTITION BY uuid
ORDER BY (uuid);
CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local
ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand());
select uuid, top1pct, top1, gputempf, cputempf
from iotjetsonjson
where toFloat32OrZero(top1pct) > 40
order by toFloat32OrZero(top1pct) desc, systemtime desc
select uuid, systemtime, networktime, te, top1pct, top1, cputempf, gputempf, cpu, diskusage, memory,filename
from iotjetsonjson
order by systemtime desc
select top1, max(toFloat32OrZero(top1pct)), max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
select top1, max(toFloat32OrZero(top1pct)) as maxTop1, max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
order by maxTop1
Tim Spann
Developer Advocate
StreamNative
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
High level introduction to Confluent REST Proxy and Schema Registry (leveraging Apache Avro under the hood), two components of the Apache Kafka open source ecosystem. See the concepts, architecture and features.
Similar to [Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story (20)
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
2. About me
Degree In Computer Science
Advanced Programming Techniques &
System Interfaces and Integration
Co-Founder, Educabits
Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit
Hadoop and MapReduce Framework
SEM keywords optimization
Big Data Architect & Tech Lead
BillyMobile
Full architecture with Hadoop:
Kafka, Storm, Hive, HBase, Spark, Druid, …
Joan Viladrosa Riera
@joanvr
joanviladrosa
joan.viladrosa@billymob.com
6. What is Apache Kafka
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
As a central point
7. What is Apache Kafka
A lot of different connectors
Apache
Storm
Apache
Spark
My Java App Logger
Kafka
Apache
Storm
Apache
Spark
My Java App
Monitoring
Tool
8. Kafka
Terminology
Topic: A feed of messages
Producer: Processes that publish
messages to a topic
Consumer: Processes that
subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
13. Kafka Semantics
In short: consumer
delivery semantics are
up to you, not Kafka
- Kafka doesn’t store the
state of the consumers*
- It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
17. What is
Apache
Spark
Streaming?
What makes it great?
- Process streams of data
- Micro-batching approach
- Same API as Spark
- Same integrations as Spark
- Same guarantees &
semantics as Spark
18. What is Apache Spark Streaming
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html
19. How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html
20. How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html
21. How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
22. How does it work?
https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
23. Spark
Streaming
Semantics
Side effects
As in Spark:
- Not guarantee exactly-once
semantics for output actions
- Any side-effecting output
operations may be repeated
- Because of node failure, process
failure, etc.
So, be careful when outputting to
external sources
25. Spark Streaming Kafka Integration Timeline
dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014
Fault Tolerant
WAL
+
Python API
Direct
Streams
+
Python API
Improved
Streaming UI
Metadata in
UI (offsets)
+
Graduated
Direct
Receivers Native Kafka
0.10
(experimental)
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
26. Kafka Receiver (≤ Spark 1.1)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
27. Executor
HDFS
WAL
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
28. Kafka Receiver with WAL (Spark 1.2)
Application
Driver
Executor
Spark
Context
Jobs
Computation
checkpointed
Receiver
Input
stream
Block
metadata
Block
metadata
written
to log
Block data
written both
memory + log
Streaming
Context
29. Kafka Receiver with WAL (Spark 1.2)
Restarted Driver Restarted
Executor
Restarted
Spark
Context
Relaunch
Jobs
Restart
computation
from info in
checkpoints Restarted
Receiver
Resend
unacked data
Recover
Block
metadata
from log
Recover Block
data from log
Restarted
Streaming
Context
30. Executor
HDFS
WAL
Kafka Receiver with WAL (Spark 1.2)
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
32. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
Driver 1. Query latest offsets
and decide offset ranges
for batch
33. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
34. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
35. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
36. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
topic1, p2,
(2010, 2110)
37. Direct Kafka Integration w/o Receiver or WAL (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
38. Direct Kafka
API benefits
- No WALs or Receivers
- Allows end-to-end
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
41. What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
42. Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
43. What’s really
New with this
New Kafka
Integration?
- New Consumer API
* Instead of Simple API
- Location Strategies
- Consumer Strategies
- SSL / TLS
- No Python API :(
44. Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with already
appropriate consumers
45. Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
46. Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
- ConsumerStrategies provides an abstraction
that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
47. Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset
for a particular partition.
● ConsumerStrategy is a public class that you can extend.
48. SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
49. How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Basic Usage
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
50. How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
53. How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Getting Metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
54. How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Store offsets in Kafka itself:
Commit API
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
// DO YOUR STUFF with DATA
stream.asInstanceOf[CanCommitOffsets]
.commitAsync(offsetRanges)
}
}
55. - At most once
- At least once
- Exactly once
Kafka +
Spark
Semantics
56. Kafka + Spark
Semantics
At most once
- We don’t want duplicates
- Not worth the hassle of ensuring that
messages don’t get lost
- Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
57. Kafka + Spark
Semantics
At most once
- This will mean you lose messages on
restart
- At least they shouldn’t get replayed.
- Test this carefully if it’s actually important
to you that a message never gets
repeated, because it’s not a common use
case.
58. Kafka + Spark
Semantics
At least once
- We don’t want to loose any record
- We don’t care about duplicates
- Example: Sending internal alerts on
relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param enable.auto.commit
to false
59. Kafka + Spark
Semantics
At least once
- Don’t be silly! Do NOT replay your whole
log on every restart…
- Manually commit the offsets when you
are 100% sure records are processed
- If this is “too hard” you’d better have a
relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
60. Kafka + Spark
Semantics
Exactly once
- We don’t want to loose any record
- We don’t want duplicates either
- Example: Storing stream in data
warehouse
1. We need some kind of idempotent writes,
or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once
61. Kafka + Spark
Semantics
Exactly once
- Probably the hardest to achieve right
- Still some small chance of failure if your
app fails just between writing data and
committing offsets… (but REALLY small)
64. Our use cases: ETL to Data Warehouse
- Input events from Kafka
- Enrich events with some external data sources
- Finally store it to Hive
- We do NOT want duplicates
- We do NOT want to lose events
65. Our use cases: ETL to Data Warehouse
- Hive is not transactional
- Neither idempotent writes
- Writing files to HDFS is “atomic” (whole or nothing)
- A relation 1:1 from each partition-batch to file in HDFS
- Store to ZK the current state of the batch
- Store to ZK offsets of last finished batch
66. Our use cases: ETL to Data Warehouse
On failure:
- If executors fails, just keep going (reschedule task)
> spark.task.maxFailures = 1000
- If driver fails (or restart):
- Load offsets and state from “current batch” if exists
and “finish” it (KafkaUtils.createRDD)
- Continue Stream from last saved offsets
67. Our use cases: Anomalies Detection
- Input events from Kafka
- Periodically load batch-computed model
- Detect when an offer stops converting (or too much)
- We do not care about losing some events (on restart)
- We always need to process the “real-time” stream
68. Our use cases: Anomalies Detection
- It’s useless to detect anomalies on a lagged stream!
- Actually it could be very bad
- Always restart stream on latest offsets
- Restart with “fresh” state
69. Our use cases: Store it to Entity Cache
- Input events from Kafka
- Almost no processing
- Store it to HBase (has idempotent writes)
- We do not care about duplicates
- We can NOT lose a single event
70. Our use cases: Store it to Entity Cache
- Since HBase has idempotent writes, we can write
events multiple times without hassle
- But, we do NOT start with earliest offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch
- But obviously we might re-write some events on restart
or failure
71. Lessons
Learned
- Do NOT use checkpointing!
- Not recoverable across upgrades
- Do your own checkpointing
- Track offsets yourself
- ZK, HDFS, DB…
- Memory might be an issue
- You do not want to waste it...
- Adjust batchDuration
- Adjust maxRatePerPartition