Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/
Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake
Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding.
|WHAT THE SESSION WILL COVER|
Best Practices for using Pulsar and NiFi
A deep dive on Apache NiFi's Pulsar connector and demos
Building an End-to-End Application in the Hybrid Cloud
Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing!
—------------------------
|AGENDA|
6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate
7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer
8:00 - 8:30 PM EST: Q&A + Networking
—------------------------
|ABOUT THE SPEAKERS|
John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data.
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
[March sn meetup] apache pulsar + apache nifi for cloud data lakeTimothy Spann
https://www.meetup.com/new-york-city-apache-pulsar-meetup/events/283837865/
Learn how to use Apache Pulsar and Apache NiFi to Stream to your Data Lake
Discover how to stream data to and from your data lake or data mart using Apache Pulsar™ and Apache NiFi®. Learn how these cloud-native, scalable open-source projects built for streaming data pipelines work together to enable you to quickly build applications with minimal coding.
|WHAT THE SESSION WILL COVER|
Best Practices for using Pulsar and NiFi
A deep dive on Apache NiFi's Pulsar connector and demos
Building an End-to-End Application in the Hybrid Cloud
Attend for a chance to win a We <3 Pulsar t-shirt! The first 50 registrants who register through here [https://hubs.ly/Q013LTpn0] will be entered in a drawing!
—------------------------
|AGENDA|
6:00 - 7:00 PM EST: Presentation - Tim Spann, StreamNative Developer Advocate
7:00 - 8:00 PM EST: Presentation - John Kuchmek, Cloudera Principal Solutions Engineer
8:00 - 8:30 PM EST: Q&A + Networking
—------------------------
|ABOUT THE SPEAKERS|
John Kuchmek is a Principal Solutions Engineer for Cloudera. Before joining Cloudera, John transitioned to the Autonomous Intelligence team where he was in charge of integrating the platforms to allow data scientists to work with various types of data.
Tim Spann is a Developer Advocate for StreamNative. He works with StreamNative Cloud, Apache Pulsar™, Apache Flink®, Flink® SQL, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science. He is currently working on a book about the FLiP Stack.
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Michael Noll
These are the slides of my Kafka talk at Apache: Big Data Europe in Budapest, Hungary. Enjoy! --Michael
Apache Kafka is a high-throughput distributed messaging system that has become a mission-critical infrastructure component for modern data platforms. Kafka is used across a wide range of industries by thousands of companies such as Twitter, Netflix, Cisco, PayPal, and many others.
After a brief introduction to Kafka this talk will provide an update on the growth and status of the Kafka project community. Rest of the talk will focus on walking the audience through what's required to put Kafka in production. We’ll give an overview of the current ecosystem of Kafka, including: client libraries for creating your own apps; operational tools; peripheral components required for running Kafka in production and for integration with other systems like Hadoop. We will cover the upcoming project roadmap, which adds key features to make Kafka even more convenient to use and more robust in production.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Apache Kafka and Apache Pulsar are both popular messaging frameworks. Apache Kafka has a big user base and people will want to know how Kafka and Pulsar are either the same or different in many respects. This talk will cover the key differences and how Pulsar adds new features that missing in Kafka.
We will cover:
The architectural differences and similarities in Pulsar and Kafka. Show use of BookKeeper and what that allows.
The Producer API and functionality differences. Show HelloWorld for both.
The Consumer API and functionality differences. Show HelloWorld for both.
The core use case and functionality differences. Show Pulsar as handling all of Kafka’s use cases and new ones that aren’t possible with Kafka.
This talk will allow people who are choosing between Kafka and Pulsar to have a more accurate and in-depth understanding of the differences between them. For companies considering a switch from Kafka to Pulsar, this talk will give them the cheatsheet to go back and make a more informed decision.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
Watch the presentation on-demand now: https://goo.gl/kceFTe
Today’s digital economy demands a new way of running business. Flexible access to information and responses in real time are essential for outpacing competition.
Watch this Denodo DataFest 2017 session to discover:
• Data access challenges faced by organizations today.
• How data virtualization facilitates real-time analytics.
• Key use cases and customer success stories.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This is the first part of the presentation.
Here is the 2nd part of this presentation:-
http://www.slideshare.net/knoldus/introduction-to-apache-kafka-part-2
Modern data systems don't just process massive amounts of data, they need to do it very fast. Using fraud detection as a convenient example, this session will include best practices on how to build real-time data processing applications using Apache Kafka. We'll explain how Kafka makes real-time processing almost trivial, discuss the pros and cons of the famous lambda architecture, help you choose a stream processing framework and even talk about deployment options.
Using FLiP with influxdb for edgeai iot at scale 2022Timothy Spann
https://adtmag.com/webcasts/2021/12/influxdata-february-10.aspx?tc=page0
FLiP Stack (Apache Flink, Apache Pulsar, Apache NiFi, Apache Spark) with Influx DB for Edge AI and IoT workloads at scale
Tim Spann
Developer Advocate
StreamNative
datainmotion.dev
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
Kafka c’est un peu la nouvelle star sur la scène des files de messages. Pourtant Kafka ne se présente pas en tant que tel, c’est un log distribué !
Alors qu’est ce que c’est ? Comment ça marche ? Et surtout comment et pourquoi je l’utilise ?
Dans cette session, on décortique la bête pour tout vous expliquer ! Au programme : des concepts, des cas d’usage, du streaming et un retour d’expérience !
Introducing Kafka Streams, the new stream processing library of Apache Kafka,...Michael Noll
Video recording: https://www.youtube.com/watch?v=o7zSLNiTZbA
Slides of my talk at Berlin Buzzwords in June 2016.
Abstract:
"In the past few years Apache Kafka has established itself as the world's most popular real-time, large-scale messaging system. It is used across a wide range of industries by thousands of companies such as Netflix, Cisco, PayPal, Twitter, and many others.
In this session I am introducing the audience to Kafka Streams, which is the latest addition to the Apache Kafka project. Kafka Streams is a stream processing library natively integrated with Kafka. It has a very low barrier to entry, easy operationalization, and a high-level DSL for writing stream processing applications. As such it is the most convenient yet scalable option to process and analyze data that is backed by Kafka. We will provide the audience with an overview of Kafka Streams including its design and API, typical use cases, code examples, and an outlook of its upcoming roadmap. We will also compare Kafka Streams' light-weight library approach with heavier, framework-based tools such as Apache Storm and Spark Streaming, which require you to understand and operate a whole different infrastructure for processing real-time data in Kafka."
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Since April 2016, Spark-as-a-service has been available to researchers in Sweden from the Swedish ICT SICS Data Center at www.hops.site. Researchers work in an entirely UI-driven environment on a platform built with only open-source software.
Spark applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin. Spark applications are run within a project on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In this talk we will discuss the challenges in building multi-tenant Spark streaming applications on YARN that are metered and easy-to-debug. We show how we use the ELK stack (Elasticsearch, Logstash, and Kibana) for logging and debugging running Spark streaming applications, how we use Graphana and Graphite for monitoring Spark streaming applications, and how users can debug and optimize terminated Spark Streaming jobs using Dr Elephant. We will also discuss the experiences of our users (over 120 users as of Sept 2016): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
To conclude, we will also give an overview on our course ID2223 on Large Scale Learning and Deep Learning, in which 60 students designed and ran SparkML applications on the platform.
Apache Kafka and Apache Pulsar are both popular messaging frameworks. Apache Kafka has a big user base and people will want to know how Kafka and Pulsar are either the same or different in many respects. This talk will cover the key differences and how Pulsar adds new features that missing in Kafka.
We will cover:
The architectural differences and similarities in Pulsar and Kafka. Show use of BookKeeper and what that allows.
The Producer API and functionality differences. Show HelloWorld for both.
The Consumer API and functionality differences. Show HelloWorld for both.
The core use case and functionality differences. Show Pulsar as handling all of Kafka’s use cases and new ones that aren’t possible with Kafka.
This talk will allow people who are choosing between Kafka and Pulsar to have a more accurate and in-depth understanding of the differences between them. For companies considering a switch from Kafka to Pulsar, this talk will give them the cheatsheet to go back and make a more informed decision.
8 Lessons Learned from Using Kafka in 1500 microservices - confluent streamin...Natan Silnitsky
Kafka is the bedrock of Wix's distributed microservices system. For the last 5 years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1500 microservices.
We’ve managed to achieve higher decoupling and independence for our various services and dev teams that have very different use-cases while maintaining a single uniform infrastructure in place.
In these slides you will learn about 8 key decisions and steps you can take in order to safely scale-up your Kafka-based system. These include:
* How to increase dev velocity of event driven style code.
* How to optimize working with Kafka in polyglot setting
* How to support growing amount of traffic and developers.
Denodo DataFest 2017: Outpace Your Competition with Real-Time ResponsesDenodo
Watch the presentation on-demand now: https://goo.gl/kceFTe
Today’s digital economy demands a new way of running business. Flexible access to information and responses in real time are essential for outpacing competition.
Watch this Denodo DataFest 2017 session to discover:
• Data access challenges faced by organizations today.
• How data virtualization facilitates real-time analytics.
• Key use cases and customer success stories.
Rethinking Stream Processing with Apache Kafka: Applications vs. Clusters, St...Michael Noll
My talk at Google DevFest Switzerland, Fribourg, Oct 2017.
https://devfest.ch/schedule/day1?sessionId=118
Abstract:
Modern businesses have data at their core, and this data is changing continuously. How can we harness this torrent of information in real-time? The answer is stream processing, and the technology that has since become the core platform for streaming data is Apache Kafka.
Among the thousands of companies that use Kafka to transform and reshape their industries are the likes of Netflix, Uber, PayPal, and AirBnB, but also established players such as Goldman Sachs, Cisco, and Oracle. Unfortunately, today’s common architectures for real-time data processing at scale suffer from complexity: there are many technologies that need to be stitched and operated together, and each individual technology is often complex by itself. This has led to a strong discrepancy between how we, as engineers, would like to work vs. how we actually end up working in practice.
In this session we talk about how Apache Kafka helps you to radically simplify your data architectures. We cover how you can now build normal applications to serve your real-time processing needs — rather than building clusters or similar special-purpose infrastructure — and still benefit from properties such as high scalability, distributed computing, and fault-tolerance, which are typically associated exclusively with cluster technologies. We discuss common use cases to realize that stream processing in practice often requires database-like functionality, and how Kafka allows you to bridge the worlds of streams and databases when implementing your own core business applications (inventory management for large retailers, patient monitoring in healthcare, fleet tracking in logistics, etc), for example in the form of event-driven, containerized microservices. We will also give a brief shout-out to the recently launched KSQL, a streaming SQL engine for Apache Kafka.
Реактивные микросервисы с Apache Kafka / Денис Иванов (2ГИС)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 7 ноября, 14:00
Тезисы:
http://www.highload.ru/2017/abstracts/3031.html
Apache Kafka - довольно популярная опенсорс-платформа для обработки потоков сообщений. Абстракция распределенного лога, лежащая в основе Kafka, дает возможность использовать ее в качестве системы очередей, но при этом дает некоторые очень полезные преимущества, недоступные даже решениям ESB-уровня.
...
Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)Ontico
HighLoad++ 2017
Зал «Дели + Калькутта», 8 ноября, 17:00
Тезисы:
http://www.highload.ru/2017/abstracts/2978.html
When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean?
...
Enabling Real-Time Business with Change Data CaptureMapR Technologies
Machine learning (ML) and artificial intelligence (AI) enable intelligent processes that can autonomously make decisions in real-time. The real challenge for effective ML and AI is getting all relevant data to a converged data platform in real-time, where it can be processed using modern technologies and integrated into any downstream systems.
Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance.
Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017Till Rohrmann
In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time.
Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications.
In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.
Apache Storm 0.9 basic training - VerisignMichael Noll
Apache Storm 0.9 basic training (130 slides) covering:
1. Introducing Storm: history, Storm adoption in the industry, why Storm
2. Storm core concepts: topology, data model, spouts and bolts, groupings, parallelism
3. Operating Storm: architecture, hardware specs, deploying, monitoring
4. Developing Storm apps: Hello World, creating a bolt, creating a topology, running a topology, integrating Storm and Kafka, testing, data serialization in Storm, example apps, performance and scalability tuning
5. Playing with Storm using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/09/15/apache-storm-training-deck-and-tutorial/
Many thanks to the Twitter Engineering team (the creators of Storm) and the Apache Storm open source community!
Denodo DataFest 2017: Integrating Big Data and Streaming Data with Enterprise...Denodo
Watch live presentation here: https://goo.gl/UcZEHU
Big data projects are becoming mature and consistent. However, they remain siloed compared to the enterprise data. In addition, now new streaming data needs to integrated as well.
Watch this Denodo DataFest 2017 session to discover:
• How big data projects can be combined with other enterprise data.
• How to integrate streaming data into the mix.
• Benefits of aggregating the data without having to move them into a centralized repository.
Apache Kafka becoming the message bus to transfer huge volumes of data from various sources into Hadoop.
It's also enabling many real-time system frameworks and use cases.
Managing and building clients around Apache Kafka can be challenging. In this talk, we will go through the best practices in deploying Apache Kafka
in production. How to Secure a Kafka Cluster, How to pick topic-partitions and upgrading to newer versions. Migrating to new Kafka Producer and Consumer API.
Also talk about the best practices involved in running a producer/consumer.
In Kafka 0.9 release, we’ve added SSL wire encryption, SASL/Kerberos for user authentication, and pluggable authorization. Now Kafka allows authentication of users, access control on who can read and write to a Kafka topic. Apache Ranger also uses pluggable authorization mechanism to centralize security for Kafka and other Hadoop ecosystem projects.
We will showcase open sourced Kafka REST API and an Admin UI that will help users in creating topics, re-assign partitions, Issuing
Kafka ACLs and monitoring Consumer offsets.
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
An overview on various concepts used in data stream processing. Most of them are used for solving problems in the field of time, focussing on processing time compared to event time. The techniques shown include the Dataflow API as it was introduced by Google and the concepts of stream and table duality. But I will also come up with other problems like data lookup and deployment of streaming applications and various strategies on solving these problems.
In the end I will give a brief outline on the implementation status of those strategies in the popular streaming frameworks Apache Spark Streaming, Apache Flink and Kafka Streams.
Building a real-time pipeline from scratch that is able to handle billion+ transactions per day, store, analyze and visualize it all in real-time has never been easier. In this build-as-we-go talk, we’ll create a front-to-back architecture that does exactly that.
* we’ll start with a simple producer emitting a few messages and publishing them onto a Kafka queue
* on consuming end of the queue a Spark-based Streamliner process will pick them up and store in MemSQL
* ZoomData will connect to MemSQL for real-time visualization where we’ll be able to ask various questions and see answers change as data is flowing through the system
* we’ll quickly make the entire pipeline more complex by increasing the amount of data as well as complexity of the data, until reaching 100K transactions per second
As we walk through this demo, we will touch on cross data-center Kafka and MemSQL set-ups, speed limitations if any as well as echo back to real-life use cases of a similar set-up used in Goldman’s Asset Management division for the purposes of Portfolio Management & Trading.
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice MachineData Con LA
In this talk, we will discuss how we use Spark as part of a hybrid RDBMS architecture that includes Hadoop and HBase. The optimizer evaluates each query and sends OLTP traffic (including CRUD queries) to HBase and OLAP traffic to Spark. We will focus on the challenges of handling the tradeoffs inherent in an integrated architecture that simultaneously handles real-time and batch traffic. Lessons learned include: - Embedding Spark into a RDBMS - Running Spark on Yarn and isolating OLTP traffic from OLAP traffic - Accelerating the generation of Spark RDDs from HBase - Customizing the Spark UI The lessons learned can also be applied to other hybrid systems, such as Lambda architectures.
Bio:-
John Leach is the CTO and Co-Founder of Splice Machine. With over 15 years of software experience under his belt, John’s expertise in analytics and BI drives his role as Chief Technology Officer. Prior to Splice Machine, John founded Incite Retail in June 2008 and led the company’s strategy and development efforts. At Incite Retail, he built custom Big Data systems (leveraging HBase and Hadoop) for Fortune 500 companies. Prior to Incite Retail, he ran the business intelligence practice at Blue Martini Software and built strategic partnerships with integration partners. John was a key subject matter expert for Blue Martini Software in many strategic implementations across the world. His focus at Blue Martini was helping clients incorporate decision support knowledge into their current business processes utilizing advanced algorithms and machine learning. John received dual bachelor’s degrees in biomedical and mechanical engineering from Washington University in Saint Louis. Leach is the organizer emeritus for the Saint Louis Hadoop Users Group and is active in the Washington University Elliot Society.
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...Cloudera, Inc.
Unternehmen sind heutzutage in der Lage ihre Daten mit relativer Leichtigkeit aufzunehmen und zu verwalten. Die Herausforderung besteht nun darin, die verborgenen Muster in den Daten zu erkennen und diese zu verstehen, um einen Mehrwert zu generieren. Aufgrund der großen Datenmengen gelingt dies mit traditionelle Ansätzen zumeist nicht. Das Ergebnis: Organisationen kämpfen, um wirklich zu innovieren und sich zu differenzieren.
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.
https://www.bigdataspain.org/2017/talk/spark-streaming-kafka-0-10-an-integration-story
Big Data Spain 2017
16th - 17th Kinépolis Madrid
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
JConWorld: Continuous SQL with Kafka and Flink
In this talk, I will walk through how someone can setup and run continous SQL queries against Kafka topics utilizing Apache Flink. We will walk through creating Kafka topics, schemas and publishing data.
We will then cover consuming Kafka data, joining Kafka topics and inserting new events into Kafka topics as they arrive. This basic over view will show hands-on techniques, tips and examples of how to do this.
Tim Spann is the Principal Developer Advocate for Data in Motion @ Cloudera where he works with Apache Kafka, Apache Flink, Apache NiFi, Apache Iceberg, TensorFlow, Apache Spark, big data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal Field Engineer at Cloudera, a Senior Solutions Architect at AirisData and a senior field engineer at Pivotal. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on big data, the IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as IoT Fusion, Strata, ApacheCon, Data Works Summit Berlin, DataWorks Summit Sydney, and Oracle Code NYC. He holds a BS and MS in computer science. https://www.datainmotion.dev/p/about-me.html https://dzone.com/users/297029/bunkertor.html
https://www.youtube.com/channel/UCDIDMDfje6jAvNE8DGkJ3_w?view_as=subscriber
In this presentation Guido Schmutz talks about Apache Kafka, Kafka Core, Kafka Connect, Kafka Streams, Kafka and "Big Data"/"Fast Data Ecosystems, Confluent Data Platform and Kafka in Architecture.
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Kafka is primarily used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data.
OSSNA Building Modern Data Streaming AppsTimothy Spann
OSSNA
Building Modern Data Streaming Apps
https://ossna2023.sched.com/event/1Jt05/virtual-building-modern-data-streaming-apps-with-open-source-timothy-spann-streamnative
Timothy Spann
Cloudera
Principal Developer Advocate
Data in Motion
In my session, I will show you some best practices I have discovered over the last seven years in building data streaming applications, including IoT, CDC, Logs, and more. In my modern approach, we utilize several open-source frameworks to maximize all the best features. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Pulsar. From there, we build streaming ETL with Apache Spark and enhance events with Pulsar Functions for ML and enrichment. We make continuous queries against our topics with Flink SQL. We will stream data into various open-source data stores, including Apache Iceberg, Apache Pinot, and others. We use the best streaming tools for the current applications with the open source stack - FLiPN. https://www.flipn.app/ Updates: This will be in-person with live coding based on feedback from the crowd. This will also include new data stores, new sources, and data relevant to and from the Vancouver area. This will also include updates to the platforms and inclusion of Apache Iceberg, Apache Pinot and some other new tech.
https://github.com/tspannhw/SpeakerProfile Tim Spann is a Principal Developer Advocate for Cloudera. He works with Apache Kafka, Apache Flink, Flink SQL, Apache NiFi, MiniFi, Apache MXNet, TensorFlow, Apache Spark, Big Data, the IoT, machine learning, and deep learning. Tim has over a decade of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Timothy J Spann
Cloudera
Principal Developer Advocate
Hightstown, NJ
Websitehttps://datainmotion.dev/
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Fast Streaming into Clickhouse with Apache PulsarTimothy Spann
https://github.com/tspannhw/SpeakerProfile/tree/main/2022/talks
Fast Streaming into Clickhouse with Apache Pulsar
https://github.com/tspannhw/FLiPC-FastStreamingIntoClickhouseWithApachePulsar
https://www.meetup.com/San-Francisco-Bay-Area-ClickHouse-Meetup/events/285271332/
Fast Streaming into Clickhouse with Apache Pulsar - Meetup 2022
StreamNative - Apache Pulsar - Stream to Altinity Cloud - Clickhouse
May the 4th Be With You!
04-May-2022 Clickhosue Meetup
CREATE TABLE iotjetsonjson_local
(
uuid String,
camera String,
ipaddress String,
networktime String,
top1pct String,
top1 String,
cputemp String,
gputemp String,
gputempf String,
cputempf String,
runtime String,
host String,
filename String,
host_name String,
macaddress String,
te String,
systemtime String,
cpu String,
diskusage String,
memory String,
imageinput String
)
ENGINE = MergeTree()
PARTITION BY uuid
ORDER BY (uuid);
CREATE TABLE iotjetsonjson ON CLUSTER '{cluster}' AS iotjetsonjson_local
ENGINE = Distributed('{cluster}', default, iotjetsonjson_local, rand());
select uuid, top1pct, top1, gputempf, cputempf
from iotjetsonjson
where toFloat32OrZero(top1pct) > 40
order by toFloat32OrZero(top1pct) desc, systemtime desc
select uuid, systemtime, networktime, te, top1pct, top1, cputempf, gputempf, cpu, diskusage, memory,filename
from iotjetsonjson
order by systemtime desc
select top1, max(toFloat32OrZero(top1pct)), max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
select top1, max(toFloat32OrZero(top1pct)) as maxTop1, max(gputempf), max(cputempf)
from iotjetsonjson
group by top1
order by maxTop1
Tim Spann
Developer Advocate
StreamNative
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Observing Intraday Indicators Using Real-Time Tick Data on Apache Superset an...DataWorks Summit
The Central Bank of the Republic of Turkey is primarily responsible for steering the monetary and exchange rate policies in Turkey.
One of the major core functions of the Bank is market operations. In this context, analyzing and interpreting real-time tick data related to money market instruments has become not only a requirement but also a challenge.
For this use case, an API provided by one of the financial data vendors has been used to gather real-time tick data and data routing has been orchestrated by Apache NiFi.
Gathered data is being transferred to Kafka topics and then handed off to Druid for real-time indexing tasks.
Indicators such as effective cost, bid-ask spread, price impact measures, return reversal are calculated using Apache Storm and finally visualized by means of Apache Superset in order to provide decision-makers with a new set of tools.
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUANatan Silnitsky
Kafka is the bedrock of Wix’s distributed Mega Microservices system.
Over the years we have learned a lot about how to successfully scale our event-driven architecture to roughly 1400 mostly Scala microservices.
In this talk, you will learn about 10 key decisions and steps you can take in order to safely scale-up your Kafka-based system.
These Include:
* How to increase dev velocity of event-driven style code.
* How to optimize working with Kafka in polyglot setting
* How to migrate from request-reply to event-driven
* How to tackle multiple DCs environment.
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
In the world of sensors and social media streams, the integration and handling of high-volume event streams is more important than ever. Events have to be handled both efficiently and reliably and often many consumers or systems are interested in all or part of the events. How do we make sure that all these event are accepted and forwarded in an efficient and reliable way? Apache Kafka, a distributed, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target can be of great help in such scenario.
This session introduces Apache Kafka and its place in a modern architecture, shows its integration with Oracle Stack and presents the Oracle Event Hub cloud service, the managed Kafka service.
Apache Kafka - A modern Stream Processing PlatformGuido Schmutz
After a quick overview and introduction of Apache Kafka, this session cover two components which extend the core of Apache Kafka: Kafka Connect and Kafka Streams/KSQL.
Kafka Connects role is to access data from the out-side-world and make it available inside Kafka by publishing it into a Kafka topic. On the other hand, Kafka Connect is also responsible to transport information from inside Kafka to the outside world, which could be a database or a file system. There are many existing connectors for different source and target systems available out-of-the-box, either provided by the community or by Confluent or other vendors. You simply configure these connectors and off you go.
Kafka Streams is a light-weight component which extends Kafka with stream processing functionality. By that, Kafka can now not only reliably and scalable transport events and messages through the Kafka broker but also analyse and process these event in real-time. Interestingly Kafka Streams does not provide its own cluster infrastructure and it is also not meant to run on a Kafka cluster. The idea is to run Kafka Streams where it makes sense, which can be inside a “normal” Java application, inside a Web container or on a more modern containerized (cloud) infrastructure, such as Mesos, Kubernetes or Docker. Kafka Streams has a lot of interesting features, such as reliable state handling, queryable state and much more. KSQL is a streaming engine for Apache Kafka, providing a simple and completely interactive SQL interface for processing data in Kafka.
Similar to [Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story (20)
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
1. Joan Viladrosa, Billy Mobile
Apache Spark Streaming
+ Kafka 0.10: An Integration Story
#EUstr5
2. About me
Joan Viladrosa Riera
@joanvr
joanviladrosa
joan.viladrosa@billymob.com
2#EUstr5
Degree In Computer Science
Advanced Programming Techniques &
System Interfaces and Integration
Co-Founder, Educabits
Educational Big data solutions
using AWS cloud
Big Data Developer, Trovit
Hadoop and MapReduce Framework
SEM keywords optimization
Big Data Architect & Tech Lead
BillyMobile
Full architecture with Hadoop:
Kafka, Storm, Hive, HBase, Spark, Druid, …
5. What is
Apache
Kafka?
- Publish - Subscribe
Message System
- Fast
- Scalable
- Durable
- Fault-tolerant
What makes it great?
5#EUstr5
6. What is Apache Kafka?
As a central point
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
6#EUstr5
7. What is Apache Kafka?
A lot of different connectors
Apache
Storm
Apache
Spark
My Java App Logger
Kafka
Apache
Storm
Apache
Spark
My Java App
Monitoring
Tool
7#EUstr5
8. Kafka
Terminology
Topic: A feed of messages
Producer: Processes that publish
messages to a topic
Consumer: Processes that
subscribe to topics and process the
feed of published messages
Broker: Each server of a kafka
cluster that holds, receives and
sends the actual data
8#EUstr5
13. Kafka Semantics
In short: consumer
delivery semantics are
up to you, not Kafka
- Kafka doesn’t store the
state of the consumers*
- It just sends you what
you ask for (topic,
partition, offset, length)
- You have to take care of
your state
13#EUstr5
16. - Process streams of data
- Micro-batching approach
What is
Apache
Spark
Streaming?
16#EUstr5
17. - Process streams of data
- Micro-batching approach
- Same API as Spark
- Same integrations as Spark
- Same guarantees &
semantics as Spark
What makes it great?
What is
Apache
Spark
Streaming?
17#EUstr5
18. What is Apache Spark Streaming?
Relying on the same Spark Engine: “same syntax” as batch jobs
https://spark.apache.org/docs/latest/streaming-programming-guide.html 18
19. How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html 19
20. How does it work?
- Discretized Streams
https://spark.apache.org/docs/latest/streaming-programming-guide.html 20
21. How does it work?
21https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
22. How does it work?
22https://databricks.com/blog/2015/07/30/diving-into-apache-spark-streamings-execution-model.html
23. Spark
Streaming
Semantics
As in Spark:
- Not guarantee exactly-once
semantics for output actions
- Any side-effecting output
operations may be repeated
- Because of node failure, process
failure, etc.
So, be careful when outputting to
external sources
Side effects
23#EUstr5
25. Spark Streaming Kafka Integration Timeline
dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014
Fault Tolerant
WAL
+
Python API
Direct
Streams
+
Python API
Improved
Streaming UI
Metadata in
UI (offsets)
+
Graduated
Direct
Receivers Native Kafka
0.10
(experimental)
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
25#EUstr5
26. Kafka Receiver (≤ Spark 1.1)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
Receiver
26#EUstr5
27. Kafka Receiver with WAL (Spark 1.2)
HDFS
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
27#EUstr5
29. Kafka Receiver with WAL (Spark 1.2)
Restarted Driver Restarted
Executor
Restarted
Spark
Context
Relaunch
Jobs
Restart
computation
from info in
checkpoints Restarted
Receiver
Resend
unacked data
Recover
Block
metadata
from log
Recover Block
data from log
Restarted
Streaming
Context
29#EUstr5
30. Kafka Receiver with WAL (Spark 1.2)
HDFS
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
30#EUstr5
32. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver 1. Query latest offsets
and decide offset ranges
for batch
32#EUstr5
33. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
33#EUstr5
34. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
1. Query latest offsets
and decide offset ranges
for batch
2. Launch jobs
using offset
ranges
Driver
topic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
3. Reads data using
offset ranges in jobs
using Simple API
34#EUstr5
35. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
35#EUstr5
36. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
36#EUstr5
37. Direct Kafka Integration w/o Receivers or WALs
(Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
37#EUstr5
38. Direct Kafka
API benefits
- No WALs or Receivers
- Allows end-to-end
exactly-once semantics
pipelines *
* updates to downstream systems should be
idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
38#EUstr5
41. What about Spark 2.0+ and
new Kafka Integration?
This is why we are here, right?
41#EUstr5
42. Spark 2.0+ new Kafka Integration
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
42#EUstr5
43. What’s really
New with this
New Kafka
Integration?
- New Consumer API
* Instead of Simple API
- Location Strategies
- Consumer Strategies
- SSL / TLS
- No Python API :(
43#EUstr5
44. Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with
appropriate consumers
44#EUstr5
45. Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
45#EUstr5
46. Consumer Strategies
- New consumer API has a number of different
ways to specify topics, some of which require
considerable post-object-instantiation setup.
- ConsumerStrategies provides an abstraction
that allows Spark to obtain properly configured
consumers even after restart from checkpoint.
46#EUstr5
47. Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of
interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset
for a particular partition.
● ConsumerStrategy is a public class that you can extend.
47#EUstr5
48. SSL/TTL encryption
- New consumer API supports SSL
- Only applies to communication between Spark
and Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
48#EUstr5
49. How to use
New Kafka
Integration on
Spark 2.0+
Scala Example Code
Basic usage
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
49#EUstr5
50. How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Getting metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
50#EUstr5
53. How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Getting metadata
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
53#EUstr5
54. How to use
New Kafka
Integration on
Spark 2.0+
Java Example Code
Store offsets in Kafka itself:
Commit API
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
// DO YOUR STUFF with DATA
stream.asInstanceOf[CanCommitOffsets]
.commitAsync(offsetRanges)
}
}
54#EUstr5
55. Kafka + Spark Semantics
- At most once
- At least once
- Exactly once
55#EUstr5
56. Kafka + Spark
Semantics
- We don’t want duplicates
- Not worth the hassle of ensuring that
messages don’t get lost
- Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset
to “largest”
4. Set Kafka param enable.auto.commit
to true
At most once
56#EUstr5
57. Kafka + Spark
Semantics
- This will mean you lose messages on
restart
- At least they shouldn’t get replayed.
- Test this carefully if it’s actually
important to you that a message never
gets repeated, because it’s not a
common use case.
At most once
57#EUstr5
58. Kafka + Spark
Semantics
- We don’t want to loose any record
- We don’t care about duplicates
- Example: Sending internal alerts on
relative rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset
to “smallest”
3. Set Kafka param enable.auto.commit
to false
At least once
58#EUstr5
59. Kafka + Spark
Semantics
- Don’t be silly! Do NOT replay your whole
log on every restart…
- Manually commit the offsets when you
are 100% sure records are processed
- If this is “too hard” you’d better have a
relative short retention log
- Or be REALLY ok with duplicates. For
example, you are outputting to an
external system that handles duplicates
for you (HBase)
At least once
59#EUstr5
60. Kafka + Spark
Semantics
- We don’t want to loose any record
- We don’t want duplicates either
- Example: Storing stream in data
warehouse
1. We need some kind of idempotent writes,
or whole-or-nothing writes (transactions)
2. Only store offsets EXACTLY after writing
data
3. Same parameters as at least once
Exactly once
60#EUstr5
61. Kafka + Spark
Semantics
- Probably the hardest to achieve right
- Still some small chance of failure if your
app fails just between writing data and
committing offsets… (but REALLY small)
Exactly once
61#EUstr5
62. Apache Kafka
Apacke Spark
at Billy Mobile
62
15Brecords monthly
35TBweekly retention log
6Kevents/second
x4growth/year
63. Our use
cases
- Input events from Kafka
- Enrich events with some
external data sources
- Finally store it to Hive
We do NOT want duplicates
We do NOT want to lose events
ETL to Data
Warehouse
63
64. Our use
cases
- Hive is not transactional
- Neither idempotent writes
- Writing files to HDFS is “atomic”
(whole or nothing)
- A relation 1:1 from each
partition-batch to file in HDFS
- Store to ZK the current state of
the batch
- Store to ZK offsets of last
finished batch
ETL to Data
Warehouse
64
65. Our use
cases
- Input events from Kafka
- Periodically load
batch-computed model
- Detect when an offer stops
converting (or too much)
- We do not care about losing
some events (on restart)
- We always need to process the
“real-time” stream
Anomalies detector
65
66. Our use
cases
- It’s useless to detect anomalies
on a lagged stream!
- Actually it could be very bad
- Always restart stream on latest
offsets
- Restart with “fresh” state
Anomalies detector
66
67. Our use
cases
- Input events from Kafka
- Almost no processing
- Store it to HBase
- (has idempotent writes)
- We do not care about duplicates
- We can NOT lose a single event
Store to
Entity Cache
67
68. Our use
cases
- Since HBase has idempotent writes,
we can write events multiple times
without hassle
- But, we do NOT start with earliest
offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch
- But obviously we might re-write some
events on restart or failure
Store to
Entity Cache
68
69. Lessons
Learned
- Do NOT use checkpointing
- Not recoverable across code upgrades
- Do your own checkpointing
- Track offsets yourself
- In general, more reliable:
HDFS, ZK, RMDBS...
- Memory usually is an issue
- You don’t want to waste it
- Adjust batchDuration
- Adjust maxRatePerPartition
69