Flink, Spark, and Storm are three popular streaming platforms compared on performance. A benchmark was created to simulate an advertising analytics pipeline with events streamed to Kafka. Flink and Storm had similar linear latency increases with throughput. Spark had higher latency due to micro-batching, but could handle higher throughput. At very high throughput, Storm performed best with acknowledgments disabled, while Flink provided low latency with processing guarantees. Overall, the platforms demonstrated tradeoffs between latency, throughput and exactly-once processing.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
The document discusses routed networks in OpenStack Neutron. It describes how routed networks implement layer 3 connectivity while allowing scalability by associating subnets to network segments. Key points include new Neutron APIs for segments and ports in routed networks, integration with the Nova scheduler, and options for implementing distributed virtual routing with features like floating IPs, multiple availability zones, and BGP routing.
The document discusses the history and growth of Jenkins, an open source automation server. It began in 2004 as a personal project by Kohsuke Kawaguchi to automate builds. Over time it grew popular and now has over 470 plugins to support various tasks. The number of plugins and releases has increased dramatically each year as more developers contribute to and use Jenkins.
The engineering teams within Splunk have been using several technologies Kinesis, SQS, RabbitMQ and Apache Kafka for enterprise wide messaging for the past few years but have recently made the decision to pivot toward Apache Pulsar, migrating both existing use cases and embedding it into new cloud-native service offerings such as the Splunk Data Stream Processor (DSP).
No data loss pipeline with apache kafkaJiangjie Qin
The document discusses how to configure Apache Kafka to prevent data loss and message reordering in a data pipeline. It recommends settings like enabling block on buffer full, using acks=all for synchronous message acknowledgment, limiting in-flight requests, and committing offsets only after messages are processed. It also suggests replicating topics across at least 3 brokers and using a minimum in-sync replica factor of 2. Mirror makers can further ensure no data loss or reordering by consuming from one cluster and producing to another in order while committing offsets. Custom consumer listeners and message handlers allow for mirroring optimizations.
The document provides an overview of the Agile methodology, including its history, principles, characteristics, and popular methods like Scrum and Extreme Programming (XP). It describes how Agile evolved in the 1990s as an alternative to heavyweight methods like the Waterfall model. Key aspects of Agile include iterative development, frequent delivery of working software, collaboration between self-organizing cross-functional teams, and responding to change over following a plan.
Software Engineering Management Framework - Building an Awesome Software Engi...Jonathan Fulton
A framework I codified to help me manage and scale our Software Engineering team at VideoBlocks. If you're in engineering management, whether at a small startup or larger company, you'll likely find some useful tidbits that will help you on your journey.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
The document discusses routed networks in OpenStack Neutron. It describes how routed networks implement layer 3 connectivity while allowing scalability by associating subnets to network segments. Key points include new Neutron APIs for segments and ports in routed networks, integration with the Nova scheduler, and options for implementing distributed virtual routing with features like floating IPs, multiple availability zones, and BGP routing.
The document discusses the history and growth of Jenkins, an open source automation server. It began in 2004 as a personal project by Kohsuke Kawaguchi to automate builds. Over time it grew popular and now has over 470 plugins to support various tasks. The number of plugins and releases has increased dramatically each year as more developers contribute to and use Jenkins.
The engineering teams within Splunk have been using several technologies Kinesis, SQS, RabbitMQ and Apache Kafka for enterprise wide messaging for the past few years but have recently made the decision to pivot toward Apache Pulsar, migrating both existing use cases and embedding it into new cloud-native service offerings such as the Splunk Data Stream Processor (DSP).
No data loss pipeline with apache kafkaJiangjie Qin
The document discusses how to configure Apache Kafka to prevent data loss and message reordering in a data pipeline. It recommends settings like enabling block on buffer full, using acks=all for synchronous message acknowledgment, limiting in-flight requests, and committing offsets only after messages are processed. It also suggests replicating topics across at least 3 brokers and using a minimum in-sync replica factor of 2. Mirror makers can further ensure no data loss or reordering by consuming from one cluster and producing to another in order while committing offsets. Custom consumer listeners and message handlers allow for mirroring optimizations.
The document provides an overview of the Agile methodology, including its history, principles, characteristics, and popular methods like Scrum and Extreme Programming (XP). It describes how Agile evolved in the 1990s as an alternative to heavyweight methods like the Waterfall model. Key aspects of Agile include iterative development, frequent delivery of working software, collaboration between self-organizing cross-functional teams, and responding to change over following a plan.
Software Engineering Management Framework - Building an Awesome Software Engi...Jonathan Fulton
A framework I codified to help me manage and scale our Software Engineering team at VideoBlocks. If you're in engineering management, whether at a small startup or larger company, you'll likely find some useful tidbits that will help you on your journey.
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Spark Summit
The document discusses securing Spark notebooks for data science by integrating Kerberos authentication. It begins with an overview of Spark notebooks and the current authentication approach. It then covers the requirements for Kerberos integration, how Kerberos works in HDFS and Yarn clusters, and a proposed design to integrate Kerberos into JupyterHub, SparkMagic and Livy to authenticate users and allow secured access to HDFS and Spark from notebooks. Key aspects of the design include custom JupyterHub authenticators and spawners, obtaining service tickets from the KDC, and propagating user identities through the system.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
This document provides definitions and explanations of key terms and artifacts used in Scrum project management. It describes the product backlog, sprint backlog, daily scrum, sprint planning meeting, sprint review, and sprint retrospective. It also outlines the roles of the product owner, scrum master, and scrum team, and includes a glossary of additional Scrum terms.
This document discusses best practices for migrating from an existing Rundeck installation to a newer version. It outlines key questions to consider regarding project structure and settings storage. Two migration approaches are described - an in-place upgrade or new server installation. Preparation steps like backups and shared resource configuration are covered. The document provides guidance on project import, database settings, and other post-migration configuration topics.
Openstack Neutron, interconnections with BGP/MPLS VPNsThomas Morin
This document discusses the Openstack Neutron networking-bgpvpn project, which provides a Neutron API and service plugin that allows tenants to interconnect their Openstack networks and routers with BGP/MPLS VPNs. The API exposes constructs like BGPVPNs, network associations, and router associations. It works with drivers for Neutron/OVS, OpenDaylight, OpenContrail, and others. The goal is to provide a common way for tenants to control interconnections in a controller-agnostic manner. The project is part of Openstack and OPNFV, and provides a model for integrating telco functionality into Openstack.
This slide share will help users to understand the agile software development methodology and how does it work. It also defines the whole process to implement scrum methodology.
Patterns (et anti-patterns) d’architecture ou comment mieux concevoir ses app...Microsoft
Développer des applications d'entreprises ou métiers, de nos jours devient de plus en plus difficile. Les applications sont de plus en plus complexes et les spécifications fonctionnelles instables et changeantes au gré des clients. Pour gérer aux mieux ces difficultés, de bonnes pratiques qui ont fait leurs preuves existent et de nouvelles tendances d’architecture et de design émergent. Que vous soyez développeurs ou architectes, cette session orientée technique vous intéressera car elle vous délivrera différents patterns (et anti-patterns à éviter) pour mieux concevoir l'architecture de vos applications métiers et ainsi mieux absorber les difficultés. On y abordera plusieurs aspects et problématiques récurrentes et comment les résoudre de façon efficace, en illustrant le tout par des exemples d’implémentation possible. Nous y verrons aussi une introduction aux nouvelles tendances d’approche comme le Domain Driven Design (DDD) ou d’architecture comme Command and Query Responsibility Segregation (CQRS) et ce que ces concepts peuvent nous apporter.
Scrum is an iterative and incremental agile software development framework for managing product development. Diceus is following this methodology in various of projects, which give us and our clients invaluable advantage during development life cycle. The result of this approach is always stable and successful product.
You could find more information about Scrum methodology and Business Intelligence in our blog:
http://blog.diceus.com/
This document provides an introduction to agile project management. It begins by contrasting traditional project management, which relies on upfront planning, with agile project management, which uses iterative development cycles. The key principles of agile project management are then outlined, including a focus on customer value, iterative and incremental delivery, experimentation and adaptation, self-organization, and continuous improvement. Popular agile methods like Scrum, Extreme Programming, and others are briefly described. The remainder of the document focuses on how the Scrum methodology works in practice and some of the challenges of applying agile principles to large projects.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
Kafka and Avro with Confluent Schema RegistryJean-Paul Azar
The document discusses Confluent Schema Registry, which stores and manages Avro schemas for Kafka clients. It allows producers and consumers to serialize and deserialize Kafka records to and from Avro format. The Schema Registry performs compatibility checks between the schema used by producers and consumers, and handles schema evolution if needed to allow schemas to change over time in a backwards compatible manner. It provides APIs for registering, retrieving, and checking compatibility of schemas.
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Spark Summit
The document discusses securing Spark notebooks for data science by integrating Kerberos authentication. It begins with an overview of Spark notebooks and the current authentication approach. It then covers the requirements for Kerberos integration, how Kerberos works in HDFS and Yarn clusters, and a proposed design to integrate Kerberos into JupyterHub, SparkMagic and Livy to authenticate users and allow secured access to HDFS and Spark from notebooks. Key aspects of the design include custom JupyterHub authenticators and spawners, obtaining service tickets from the KDC, and propagating user identities through the system.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
Richard Garris presented on ways to productionize machine learning models built with Apache Spark MLlib. He discussed serializing models using MLlib 2.X to save models for production use without reimplementation. This allows data scientists to build models in Python/R and deploy them directly for scoring. He also reviewed model scoring architectures and highlighted Databricks' private beta solution for deploying serialized Spark MLlib models for low latency scoring outside of Spark.
This document provides definitions and explanations of key terms and artifacts used in Scrum project management. It describes the product backlog, sprint backlog, daily scrum, sprint planning meeting, sprint review, and sprint retrospective. It also outlines the roles of the product owner, scrum master, and scrum team, and includes a glossary of additional Scrum terms.
This document discusses best practices for migrating from an existing Rundeck installation to a newer version. It outlines key questions to consider regarding project structure and settings storage. Two migration approaches are described - an in-place upgrade or new server installation. Preparation steps like backups and shared resource configuration are covered. The document provides guidance on project import, database settings, and other post-migration configuration topics.
Openstack Neutron, interconnections with BGP/MPLS VPNsThomas Morin
This document discusses the Openstack Neutron networking-bgpvpn project, which provides a Neutron API and service plugin that allows tenants to interconnect their Openstack networks and routers with BGP/MPLS VPNs. The API exposes constructs like BGPVPNs, network associations, and router associations. It works with drivers for Neutron/OVS, OpenDaylight, OpenContrail, and others. The goal is to provide a common way for tenants to control interconnections in a controller-agnostic manner. The project is part of Openstack and OPNFV, and provides a model for integrating telco functionality into Openstack.
This slide share will help users to understand the agile software development methodology and how does it work. It also defines the whole process to implement scrum methodology.
Patterns (et anti-patterns) d’architecture ou comment mieux concevoir ses app...Microsoft
Développer des applications d'entreprises ou métiers, de nos jours devient de plus en plus difficile. Les applications sont de plus en plus complexes et les spécifications fonctionnelles instables et changeantes au gré des clients. Pour gérer aux mieux ces difficultés, de bonnes pratiques qui ont fait leurs preuves existent et de nouvelles tendances d’architecture et de design émergent. Que vous soyez développeurs ou architectes, cette session orientée technique vous intéressera car elle vous délivrera différents patterns (et anti-patterns à éviter) pour mieux concevoir l'architecture de vos applications métiers et ainsi mieux absorber les difficultés. On y abordera plusieurs aspects et problématiques récurrentes et comment les résoudre de façon efficace, en illustrant le tout par des exemples d’implémentation possible. Nous y verrons aussi une introduction aux nouvelles tendances d’approche comme le Domain Driven Design (DDD) ou d’architecture comme Command and Query Responsibility Segregation (CQRS) et ce que ces concepts peuvent nous apporter.
Scrum is an iterative and incremental agile software development framework for managing product development. Diceus is following this methodology in various of projects, which give us and our clients invaluable advantage during development life cycle. The result of this approach is always stable and successful product.
You could find more information about Scrum methodology and Business Intelligence in our blog:
http://blog.diceus.com/
This document provides an introduction to agile project management. It begins by contrasting traditional project management, which relies on upfront planning, with agile project management, which uses iterative development cycles. The key principles of agile project management are then outlined, including a focus on customer value, iterative and incremental delivery, experimentation and adaptation, self-organization, and continuous improvement. Popular agile methods like Scrum, Extreme Programming, and others are briefly described. The remainder of the document focuses on how the Scrum methodology works in practice and some of the challenges of applying agile principles to large projects.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Kafka is a distributed messaging system that allows for publishing and subscribing to streams of records, known as topics. Producers write data to topics and consumers read from topics. The data is partitioned and replicated across clusters of machines called brokers for reliability and scalability. A common data format like Avro can be used to serialize the data.
Kafka and Avro with Confluent Schema RegistryJean-Paul Azar
The document discusses Confluent Schema Registry, which stores and manages Avro schemas for Kafka clients. It allows producers and consumers to serialize and deserialize Kafka records to and from Avro format. The Schema Registry performs compatibility checks between the schema used by producers and consumers, and handles schema evolution if needed to allow schemas to change over time in a backwards compatible manner. It provides APIs for registering, retrieving, and checking compatibility of schemas.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
Apache Kafka is a distributed streaming platform that allows for building real-time data pipelines and streaming apps. It provides a publish-subscribe messaging system with persistence that allows for building real-time streaming applications. Producers publish data to topics which are divided into partitions. Consumers subscribe to topics and process the streaming data. The system handles scaling and data distribution to allow for high throughput and fault tolerance.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
Introduction to Kafka streaming platform. Covers Kafka Architecture with some small examples from the command line. Then we expand on this with a multi-server example. Lastly, we added some simple Java client examples for a Kafka Producer and a Kafka Consumer. We have started to expand on the Java examples to correlate with the design discussion of Kafka. We have also expanded on the Kafka design section and added references.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Spark Streaming allows processing live data streams using small batch sizes to provide low latency results. It provides a simple API to implement complex stream processing algorithms across hundreds of nodes. Spark SQL allows querying structured data using SQL or the Hive query language and integrates with Spark's batch and interactive processing. MLlib provides machine learning algorithms and pipelines to easily apply ML to large datasets. GraphX extends Spark with an API for graph-parallel computation on property graphs.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
This document summarizes a presentation on extending Spark Streaming to support complex event processing. It discusses:
1) Motivations for supporting CEP in Spark Streaming, as current Spark is not enough to support continuous query languages or auto-scaling of resources.
2) Proposed solutions including extending Intel's Streaming SQL package, improving windowed aggregation performance, supporting "Insert Into" queries to enable query chains, and implementing elastic resource allocation through auto-scaling in/out of resources.
3) Evaluation of the Streaming SQL extensions showing low processing delays despite heavy loads or large windows, though more memory optimization is needed.
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
The Marketplace data team at Uber has built a scalable complex event processing platform to solve many challenging real-time data needs for various Uber products. This platform has been in production for more than a year and supports over 100 real-time data use cases with a team of 3. In this talk, we will share the detail of the design and our experience, and how we employ Siddhi, Kafka and Samza at scale.
Data Stream Processing with Apache FlinkFabian Hueske
This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.
The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.
The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights.
To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
This document provides an overview of stream processing with Apache Flink. It discusses the rise of stream processing and how it enables low-latency applications and real-time analysis. It then describes Flink's stream processing capabilities, including pipelining of data, fault tolerance through checkpointing and recovery, and integration with batch processing. The document also summarizes Flink's programming model, state management, and roadmap for further development.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
Abstract:-
With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Bio:-
Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
What no one tells you about writing a streaming apphadooparchbook
This document discusses 5 things that are often not addressed when writing streaming applications:
1. Managing and monitoring long-running streaming jobs can be challenging as frameworks were not originally designed for streaming workloads. Options include using cluster mode to ensure jobs continue if clients disconnect and leveraging monitoring tools to track metrics.
2. Preventing data loss requires different approaches depending on the data source. File and receiver-based sources benefit from checkpointing while Kafka's commit log ensures data is not lost.
3. Spark Streaming is well-suited for tasks involving windowing, aggregations, and machine learning but may not be needed for all streaming use cases.
4. Achieving exactly-once semantics requires techniques
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Similar to Performance Comparison of Streaming Big Data Platforms (20)
This document discusses running Apache Spark and Apache Zeppelin in production. It begins by introducing the author and their background. It then covers security best practices for Spark deployments, including authentication using Kerberos, authorization using Ranger/Sentry, encryption, and audit logging. Different Spark deployment modes like Spark on YARN are explained. The document also discusses optimizing Spark performance by tuning executor size and multi-tenancy. Finally, it covers security features for Apache Zeppelin like authentication, authorization, and credential management.
This document discusses Spark security and provides an overview of authentication, authorization, encryption, and auditing in Spark. It describes how Spark leverages Kerberos for authentication and uses services like Ranger and Sentry for authorization. It also outlines how communication channels in Spark are encrypted and some common issues to watch out for related to Spark security.
The document discusses the Virtual Data Connector project which aims to leverage Apache Atlas and Apache Ranger to provide unified metadata and access governance across data sources. Key points include:
- The project aims to address challenges of understanding, governing, and controlling access to distributed data through a centralized metadata catalog and policies.
- Apache Atlas provides a scalable metadata repository while Apache Ranger enables centralized access governance. The project will integrate these using a virtualization layer.
- Enhancements to Atlas and Ranger are proposed to better support the project's goals around a unified open metadata platform and metadata-driven governance.
- An initial minimum viable product will be built this year with the goal of an open, collaborative ecosystem around shared
This document discusses using a data science platform to enable digital diagnostics in healthcare. It provides an overview of healthcare data sources and Yale/YNHH's data science platform. It then describes the data science journey process using a clinical laboratory use case as an example. The goal is to use big data and machine learning to improve diagnostic reproducibility, throughput, turnaround time, and accuracy for laboratory testing by developing a machine learning algorithm and real-time data processing pipeline.
This document discusses using Apache Spark and MLlib for text mining on big data. It outlines common text mining applications, describes how Spark and MLlib enable scalable machine learning on large datasets, and provides examples of text mining workflows and pipelines that can be built with Spark MLlib algorithms and components like tokenization, feature extraction, and modeling. It also discusses customizing ML pipelines and the Zeppelin notebook platform for collaborative data science work.
This document compares the performance of Hive and Spark when running the BigBench benchmark. It outlines the structure and use cases of the BigBench benchmark, which aims to cover common Big Data analytical properties. It then describes sequential performance tests of Hive+Tez and Spark on queries from the benchmark using a HDInsight PaaS cluster, finding variations in performance between the systems. Concurrency tests are also run by executing multiple query streams in parallel to analyze throughput.
The document discusses modern data applications and architectures. It introduces Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets across clusters of commodity hardware. Hadoop provides massive scalability and easy data access for applications. The document outlines the key components of Hadoop, including its distributed storage, processing framework, and ecosystem of tools for data access, management, analytics and more. It argues that Hadoop enables organizations to innovate with all types and sources of data at lower costs.
This document provides an overview of data science and machine learning. It discusses what data science and machine learning are, including extracting insights from data and computers learning without being explicitly programmed. It also covers Apache Spark, which is an open source framework for large-scale data processing. Finally, it discusses common machine learning algorithms like regression, classification, clustering, and dimensionality reduction.
This document provides an overview of Apache Spark, including its capabilities and components. Spark is an open-source cluster computing framework that allows distributed processing of large datasets across clusters of machines. It supports various data processing workloads including streaming, SQL, machine learning and graph analytics. The document discusses Spark's APIs like DataFrames and its libraries like Spark SQL, Spark Streaming, MLlib and GraphX. It also provides examples of using Spark for tasks like linear regression modeling.
This document provides an overview of Apache NiFi and dataflow. It begins with an introduction to the challenges of moving data effectively within and between systems. It then discusses Apache NiFi's key features for addressing these challenges, including guaranteed delivery, data buffering, prioritized queuing, and data provenance. The document outlines NiFi's architecture and components like repositories and extension points. It also previews a live demo and invites attendees to further discuss Apache NiFi at a Birds of a Feather session.
Many Organizations are currently processing various types of data and in different formats. Most often this data will be in free form, As the consumers of this data growing it’s imperative that this free-flowing data needs to adhere to a schema. It will help data consumers to have an expectation of about the type of data they are getting and also they will be able to avoid immediate impact if the upstream source changes its format. Having a uniform schema representation also gives the Data Pipeline a really easy way to integrate and support various systems that use different data formats.
SchemaRegistry is a central repository for storing, evolving schemas. It provides an API & tooling to help developers and users to register a schema and consume that schema without having any impact if the schema changed. Users can tag different schemas and versions, register for notifications of schema changes with versions etc.
In this talk, we will go through the need for a schema registry and schema evolution and showcase the integration with Apache NiFi, Apache Kafka, Apache Storm.
There is increasing need for large-scale recommendation systems. Typical solutions rely on periodically retrained batch algorithms, but for massive amounts of data, training a new model could take hours. This is a problem when the model needs to be more up-to-date. For example, when recommending TV programs while they are being transmitted the model should take into consideration users who watch a program at that time.
The promise of online recommendation systems is fast adaptation to changes, but methods of online machine learning from streams is commonly believed to be more restricted and hence less accurate than batch trained models. Combining batch and online learning could lead to a quickly adapting recommendation system with increased accuracy. However, designing a scalable data system for uniting batch and online recommendation algorithms is a challenging task. In this talk we present our experiences in creating such a recommendation engine with Apache Flink and Apache Spark.
DeepLearning is not just a hype - it outperforms state-of-the-art ML algorithms. One by one. In this talk we will show how DeepLearning can be used for detecting anomalies on IoT sensor data streams at high speed using DeepLearning4J on top of different BigData engines like ApacheSpark and ApacheFlink. Key in this talk is the absence of any large training corpus since we are using unsupervised machine learning - a domain current DL research threats step-motherly. As we can see in this demo LSTM networks can learn very complex system behavior - in this case data coming from a physical model simulating bearing vibration data. Once draw back of DeepLearning is that normally a very large labaled training data set is required. This is particularly interesting since we can show how unsupervised machine learning can be used in conjunction with DeepLearning - no labeled data set is necessary. We are able to detect anomalies and predict braking bearings with 10 fold confidence. All examples and all code will be made publicly available and open sources. Only open source components are used.
QE automation for large systems is a great step forward in increasing system reliability. In the big-data world, multiple components have to come together to provide end-users with business outcomes. This means, that QE Automations scenarios need to be detailed around actual use cases, cross-cutting components. The system tests potentially generate large amounts of data on a recurring basis, verifying which is a tedious job. Given the multiple levels of indirection, the false positives of actual defects are higher, and are generally wasteful.
At Hortonworks, we’ve designed and implemented Automated Log Analysis System - Mool, using Statistical Data Science and ML. Currently the work in progress has a batch data pipeline with a following ensemble ML pipeline which feeds into the recommendation engine. The system identifies the root cause of test failures, by correlating the failing test cases, with current and historical error records, to identify root cause of errors across multiple components. The system works in unsupervised mode with no perfect model/stable builds/source-code version to refer to. In addition the system provides limited recommendations to file/open past tickets and compares run-profiles with past runs.
Improving business performance is never easy! The Natixis Pack is like Rugby. Working together is key to scrum success. Our data journey would undoubtedly have been so much more difficult if we had not made the move together.
This session is the story of how ‘The Natixis Pack’ has driven change in its current IT architecture so that legacy systems can leverage some of the many components in Hortonworks Data Platform in order to improve the performance of business applications. During this session, you will hear:
• How and why the business and IT requirements originated
• How we leverage the platform to fulfill security and production requirements
• How we organize a community to:
o Guard all the players, no one gets left on the ground!
o Us the platform appropriately (Not every problem is eligible for Big Data and standard databases are not dead)
• What are the most usable, the most interesting and the most promising technologies in the Apache Hadoop community
We will finish the story of a successful rugby team with insight into the special skills needed from each player to win the match!
DETAILS
This session is part business, part technical. We will talk about infrastructure, security and project management as well as the industrial usage of Hive, HBase, Kafka, and Spark within an industrial Corporate and Investment Bank environment, framed by regulatory constraints.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
There has been an explosion of data digitising our physical world – from cameras, environmental sensors and embedded devices, right down to the phones in our pockets. Which means that, now, companies have new ways to transform their businesses – both operationally, and through their products and services – by leveraging this data and applying fresh analytical techniques to make sense of it. But are they ready? The answer is “no” in most cases.
In this session, we’ll be discussing the challenges facing companies trying to embrace the Analytics of Things, and how Teradata has helped customers work through and turn those challenges to their advantage.
In this talk, we will present a new distribution of Hadoop, Hops, that can scale the Hadoop Filesystem (HDFS) by 16X, from 70K ops/s to 1.2 million ops/s on Spotiy's industrial Hadoop workload. Hops is an open-source distribution of Apache Hadoop that supports distributed metadata for HSFS (HopsFS) and the ResourceManager in Apache YARN. HopsFS is the first production-grade distributed hierarchical filesystem to store its metadata normalized in an in-memory, shared nothing database. For YARN, we will discuss optimizations that enable 2X throughput increases for the Capacity scheduler, enabling scalability to clusters with >20K nodes. We will discuss the journey of how we reached this milestone, discussing some of the challenges involved in efficiently and safely mapping hierarchical filesystem metadata state and operations onto a shared-nothing, in-memory database. We will also discuss the key database features needed for extreme scaling, such as multi-partition transactions, partition-pruned index scans, distribution-aware transactions, and the streaming changelog API. Hops (www.hops.io) is Apache-licensed open-source and supports a pluggable database backend for distributed metadata, although it currently only support MySQL Cluster as a backend. Hops opens up the potential for new directions for Hadoop when metadata is available for tinkering in a mature relational database.
In high-risk manufacturing industries, regulatory bodies stipulate continuous monitoring and documentation of critical product attributes and process parameters. On the other hand, sensor data coming from production processes can be used to gain deeper insights into optimization potentials. By establishing a central production data lake based on Hadoop and using Talend Data Fabric as a basis for a unified architecture, the German pharmaceutical company HERMES Arzneimittel was able to cater to compliance requirements as well as unlock new business opportunities, enabling use cases like predictive maintenance, predictive quality assurance or open world analytics. Learn how the Talend Data Fabric enabled HERMES Arzneimittel to become data-driven and transform Big Data projects from challenging, hard to maintain hand-coding jobs to repeatable, future-proof integration designs.
Talend Data Fabric combines Talend products into a common set of powerful, easy-to-use tools for any integration style: real-time or batch, big data or master data management, on-premises or in the cloud.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on automated letter generation for Bonterra Impact Management using Google Workspace or Microsoft 365.
Interested in deploying letter generation automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Performance Comparison of Streaming Big Data Platforms
1. Performance Comparison of Streaming Big
Data Platforms
Reza Farivar
Capital One Inc.
Kyle Knusbaum
Yahoo Inc.
2. Streaming Computation engines
• Designed to process a continuous stream of data.
• Designed to process data with low latency – data (ideally) doesn’t buffer up before
being processed. Contrasts with batch processing - MapReduce.
• Designed to handle big data. The systems are distributed by design.
3. • Apache Storm has the TopologyBuilder API to create a directed graph (topology) through
which streams of data flow.
• “Spouts” are the entry point to the graph, and “bolts” perform the processing.
• Data flows through the system as individual tuples.
• Graphs are not necessarily acyclic (although that is often the case)
Kafka Spout
Database
4. • Apache Flink has the DataStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.)
• These operations are turned into a graph at job submission time by Flink.
• Underlying graph works similarly to Storm’s model.
• Also supports a Storm-compatible API
Database
5. • Apache Spark has the DStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset)
abstraction.
• Similar to Flink’s API.
• Streaming accomplished through micro-batches.
• Spark streaming job consists of one small batch after another.
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
Spark Streaming
Database
6. Benchmark
• We would like to compare the platforms, but which benchmark?
– How to compare the relative effectiveness of these systems?
• Throughput (events per second)
• End-to-end latency (How long for an event to get through the system)
• Completeness (Is the computation correct?)
– Current benchmarks did not test with workloads similar to a real world use
case
• Speed of light tests only reveal so much information
• So we created a new benchmark (on github)
– A simple advertisement counting application
– Mimic some common ETL operations on data streams
7. Our Streaming benchmark
• Goal is to correlate latency with throughput.
• Simulation of an advertisement analytics pipeline.
• Must be implemented and run in all three engines.
• Initial data:
– Some number of advertising campaigns.
– Some number of ads per campaign.
• Initial data stored in Redis.
• Our producers read the initial data, and start generating various events. (view, click, purchase)
• Events are then sent to a Kafka cluster.
Benchmark Event
Producer
9. Measuring Latency
– Windows periodically stored into Redis along with a timestamp of when the window
was written into Redis.
• Application given an SLA (Service-Level Agreement) as part of the simulation,
demanding that tuples be processed in under 1 second.
• The period of writes was chosen to meet the SLA. Writes to Redis were performed once
per second. Spark is an exception. It wrote windows out once per batch
10. Measuring Latency
• Ten second window
• First event generated
• 10 seconds of events – 10’s of thousands of events
per second
• Last event generated near end of window
• At some point later, the window is written into Redis.
• We know the time of the end of the window,
and the time the window was written.
• This time gives us a data point of latency – length of
time between event generation and being written in
database.
• Events processed late will cause their windows to be
written at a later time, and will be reflected in the
data.
10 s
1st event
in window
Last event
in window
Window data
written into
Redis
Latency data point
(Ideally less than SLA)
11. Our methodology
• Generate a particular throughput of events, then measure the latency.
– Throughputs measured varied between 50,000 events/s and 170,000 events/s
• 100 advertising campaigns
• 10 ads per campaign
• SLA set at 1 second
• 10 second windows
• 5 Kafka nodes with 5 topic partitions
• 1 Redis node
• 3 ZooKeeper nodes (cluster-coordination software)
• 10 worker nodes (doing computation)
• Handful of nodes used by the systems as masters, other non-compute servers.
12. Our methodology
1. Totally clear Kafka of data
2. Populate Redis with initial data
3. Launch the advertising analytics application on Spark, Flink, or Storm
4. Wait a bit for all workers to finish launching
5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput.
– Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s.
6. Let the system run for 30 minutes after starting the producers, then shut the producers down.
7. Run data gathering tool on the Redis database to generate latency points from the windows.
13. Hardware Setup
• Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per
node
• 24GiB of memory
• Machines on the same rack
• Gigabit Ethernet switch
• The cluster has 40 nodes, 20-25 used in benchmark
• Multiple instances of Kafka producers to create load
– individual producers fall behind at around 17,000 events per second
• The use of 10 workers for a topology is near the average number we see being used by
topologies internal to Yahoo
– The Storm clusters are larger, but multi-tenant & run many topologies
14. About the implementations
• Apache Flink
– Tested 0.10.1-SNAPSHOT (commit hash 7364ce1).
– Application written in Java using the DataStream API.
– Checkpointing – a feature that guarantees at-least-once processing – was disabled.
• Apache Spark
– Tested version 1.5
– Application written in Scala using the DStreams API.
– At-least-once processing not implemented.
• Apache Storm
– Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a).
– Application written using the Java API.
– Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT
15. Flink
• Most tuples finished
within 1 second SLA.
• Sharp curve indicates
there was a very small
number of straggling
tuples that were written
into Redis late.
• Red dots mark 1st 10th
25th 50th 75th 90th 99th
and 100th percentiles.
16. Flink
Late Tuples
• Of late tuples, most were
written within a few
milliseconds of the SLA’s
deadline.
• This emphasizes only a
very small number were
significantly late.
• Beyond about 170,000
tuples, Flink was unable
to handle the
throughput, and tuples
backed up.
17. Spark Streaming
• Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct
Kafka Consumer
• Micro-batching
– different than the pure streaming nature of Storm and Flink
– To meet 1 sec SLA, the batch duration was set to 1 second
• Forced to increase the batch duration for larger throughputs
• Transformations (e.g. maps and filters) applied on the Dstreams
• Joining data with Redis a special case
– Should not create a separate connection to Redis for each record use a mapPartitions
operation that can give control of a whole RDD partition to our code
• create one connection to Redis and use this single connection to query information from Redis for
all the events in that RDD partition.
18. Spark 2-dimensional Parameter Adjustment
• Micro-batch duration
– This is a control dimension that is not present in a pure streaming system like Storm
– Increasing the duration increases latency while reducing overhead and therefore increasing
maximum throughput
– Finding optimal batch duration that minimizes latency while allowing spark to handle the
throughput is a time consuming process
• Set a batch duration, run the benchmark for 30 minutes, check the results decrease/increase the
duration
• Parallelism
– increasing parallelism is simpler said than done in Spark
– In a true streaming system like Storm, one bolt instance can send its results to any number of
subsequent bolt instances
– In a micro batch system like Spark, perform a reshuffle operation
• similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the
cluster.
• But the reshuffling itself introduces considerable overhead.
19. Spark
• Spark had more
interesting results than
Flink.
• Due to the micro-batch
design, it was unable to
process events at low
latencies
• The overhead of
scheduling and
launching a task per
batch is very high
• Batch size had to be
increased – this
overcame the launch
overhead.
20. Spark
• If we reduce the batch
duration sufficiently, we
get into a region where
the incoming events are
processed within 3 or 4
subsequent batches.
• The system on the verge
of falling behind, but is
still manageable, and
results in better latency.
21. Spark
Falling behind
• Without increasing the
batch size, Spark was
unable to keep up with
the throughput, tuples
backed up, and latencies
continuously increased
until the job was shut
down.
• After increasing the
batchsize, Spark handled
larger throughputs than
either Storm or Flink.
22. Spark
• Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest
problems we faced in testing Spark’s Streaming capabilities.
• If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind,
tuples would back up, and latency numbers would be worse.
• Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were
unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.
23. Storm Results
• Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16
executors - one for each core.
• In 0.11.0, Storm added a simple back pressure controller avoid the overhead of acking
– In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees.
• With acking disabled, Storm even beat Flink for latency at high throughput.
– But no tuple failure handling
Storm 0.10.0 Storm 0.11.0
24. Storm
• Storm behaved very
similarly to Flink.
• However, Storm was
unable to handle more
than 130,000 events/s
with its acking system
enabled.
• Acking keeps track of
successfully processed
events within Storm.
• With acking disabled,
Storm achieved numbers
similar to Flink at
throughputs up to
170,000 events/s.
25. Storm
Late Tuples
• Similar to Flink’s late
tuple graph.
• Tuples that were late
were slightly less late
than Flink’s.
26. Three-way Comparison
• Flink and Storm have
similar linear
performance profiles
– These two systems
process an incoming
event as it becomes
available
• Spark Streaming has
much higher latency,
but is expected to
handle higher
throughputs
– System behaves in a
stepwise function, a
direct result from its
micro-batching
nature
27. Flink
Spark
Storm
• Comparisons of 99-th
percentile latencies are
revealing.
• Storm 0.11 consistently
lower latency than Flink
and Spark.
• Flink’s latency comparable
to Storm 0.10, but
handled higher
throughput with at-least-
once guarantees.
• Spark had the highest
latency, but was able to
handle higher throughput
than either Storm or Flink
28. Future work
• Many variables involved – many we didn’t adjust.
• Applications were not optimized – all were written in a fairly plain manner and configuration
settings were not tweaked
• SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming
systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely
showing Spark being highly competitive with the others.
• The throughputs we tested at were incredibly high.
– 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day
• Didn’t test with exactly-once semantics.
• Ran small tests and checked for correctness of computations, but didn’t check correctness at
large scale.
• There are many more tests that can be run.
• Other streaming engines can be added.
29. Conclusions
• The competition between near real time streaming systems is
heating up, and there is no clear winner at this point
• Each of the platforms studied here have their advantages and
disadvantages
• Other important factors:
– Security or integration with tools and libraries
• Active communities for these and other big data processing
projects continue to innovate and benefit from each other’s
advancements
Editor's Notes
Streaming computation engines – what are they.
They are systems designed to process a continuous stream of data.
They are designed to have very low latency. What this means is that – ideally – data gets processed as soon as it reaches the system; it doesn’t buffer up.
This is in contrast to something like Hadoop’s MapReduce, where incoming data goes into a file somewhere, and every couple hours or so a job runs that processes it all in one big batch.
These are so-called “big-data” systems. They’re designed to be distributed and handle massive quantities of data.
We have three of them here that we’re going to look at today.
The first one we’re going to look at is Apache Storm.
Storm’s API gives users tools to create a directed graph, called a topology in Storm, through which data flows. Each node of this graph is a piece of user code that does some processing.
Nodes are either spouts or bolts. Spouts are the entry point to the graph, and bolts perform the processing.
The data moves through the system as individual tuples. It’s the job of the spout to take incoming data and turn it into tuples to pass on to the bolts.
Storm’s graphs are not necessarily acyclic – which is interesting. Most use cases we’ve seen seem to involve acyclic data flows, but it is possible to have cycles.
Flink!
Flink has its DataStream API to perform operations on streams of data, operations like map, filter, reduce, join and so on.
Instead of having the users construct a graph, users just describe what they want to happen to the data, and Flink builds a graph for them.
The underlying graph works very similarly to Storm’s
So similar, that Flink actually built a Storm-compatible API, and they claim you can run unmodified storm applications on Flink.
Spark Streaming!
Spark Streaming has the DStream API to perform operations on streams of data. It is based on Spark’s RDDs, or Resilient Distributed Datasets
The API is super similar to Flink’s
The underlying model, however, is very different than both Storm’s and Flink’s.
Spark’s streaming capabilities are accomplished through something called micro-batching.
Micro-batching is basically just running very small batch jobs in quick succession.
So each one of these RDDs down here would be a tiny batch of data in a spark streaming job.
We used our benchmark to correlate latency and throughput in the systems.
We simulated an advertisement analytics pipeline, which counts clicks in ad-campaigns.
The application needed to be implemented and run in all three engines.
We started out with some initial data, which were some number of advertising campaigns, and some number of ads in each campaign. We made these numbers adjustable.
-
The initial data we stored in a Redis instance.
-
We had some producer processes then read the initial data out of Redis, and begin generating various events for advertisements like views, clicks, and purchases.
-
These events it then sent into Kafka - Kafka is a distributed pub/sub system. Events go into Kafka from publishers and go out of Kafka to subscribers.
The application itself performs operations on each event, and they go like this:
First: deserialize the JSON string and turn it into a native data structure.
Second: Filter the events. We’re only counting clicks in this application, so we drop all events that don’t have an ad_type of “click”.
Third: We take what’s called a projection of the events – That just means we drop all of the fields in the tuple that we aren’t interested in. We’re left with just ad_id and event_time.
If you remember earlier I highlighted three fields that were important. We’re down to two important fields now because we already used ad_type and we’re done with it.
All of our events have the same ad_type now, so we can drop it.
Fourth: Go and pull the campaign_id assiciated with the ad_id out of redis. This is part of the initial data that we put into Redis. Join this field into the tuple.
Fifth: Take a windowed count of events per campaign – so we keep track of how many clicks each campaign has gotten in each time window.
Last: Periodically write these windows into Redis – This will be the data we use to calculate latencies.
The system needs to be able to take late events into account – This is just a constraint we put on the application since it’s one we see often in the real world.
As I mentioned, the windows are periodically written into Redis along with a timestamp of when the window was written into Redis.
This last part is important. Each window has a timestamp like this, and it represents when that window was last written into Redis.
The application is given an SLA or Service-Level Agreement as part of the simulation, which says that tuples must be processed completely end-to-end in under 1 second.
This is just another constraint that we put on our application as part of simulating a real-world use case. The 1-second SLA is basically just a target end-to-end latency; it’s what the systems are trying to achieve.
To this end, we had the applications write their windows out once per second. Spark is the exception here. Its computation model doesn’t allow us to write windows out once per second. Instead, we write the windows out once per batch.
Now we actually get to look at how we acquire the latency data.
For our experiment we ran with 10-second windows.
-
In every window the first event is generated basically right when the window begins
-
After that, it’s 10 seconds of events – 10’s of thousands of events per second.
-
The last event is generated very near the end of the window – within microseconds before it.
The last event goes off to be processed…
-
Some time later, the window is written into Redis by the application.
-
Now, we know the time of the end of the window – where the last event was written, and we know the time when the window was written to Redis.
-
This gives us a latency data point. This chunk of time here is the amount of time that passed between the last event’s generation and when it was written into Redis. – This is the end-to-end latency of the application.
-
You can see how events that are processed late will cause their windows to be written at a later time, and will be reflected as higher end-to-end latency in the data.
So that’s how we measure latency. Next
Our methodology for testing was pretty simple.
We have our producers generate a certain event throughput, and then we measure the latency of tuples going through the system.
Throughputs measured varied between 50,000 events per second and 170,000 events per second.
We had…
Steps were:
Now we’re going to look at the benchmark results from each system.
-
First is Flink:
The version we tested was a 0.10.1-SNAPSHOT
We wrote the application using the Java DataStream API.
Checkpointing was disabled – so there were no processing guarantees.
-
Spark:
The version we tested was 1.5
We wrote this one in Scala using the DStreams API.
In addition, we did not implement at-least-once semantics.
-
Storm:
For storm, we tested both versions 0.10 and a 0.11-SNAPSHOT
Application written using the java TopologyBuilder API.
Storm’s acking provides at-least-once processing and flow control, but a new feature allowed us to turn that off for high throughputs in 0.11
Some things we noticed about flink:
Most of the tuples were processed within the 1-second SLA we specified.
The graph here shows percentiles - so the red dots in the middle there are the 50-th percentile mark – 50% of the tuples were in at about .75 seconds.
The sharp curve at the end is interesting – shows that a small number were quite late.
Here is a graph of the latency for late tuples in Flink.
Late tuples are the ones that finished processing after the 1 second SLA.
This graph emphasizes that most tuples were on time or very nearly on time. Only a small percentage were late by any significant amount.
Initially, we thought our operations were CPU-bound, and so the benefits of reshuffling to a higher number of partitions would outweigh the cost of reshuffling. Instead, we found the bottleneck to be scheduling, and so reshuffling only added overhead. We suspect that at higher throughput rates or with operations that are CPU-bound, the reverse would be true.
Spark was more difficult to get results out of, but the results were more interesting.
The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs.
This was due to the large overhead of scheduling and launching a task for each micro-batch.
Once we increased the batch size, spark was able to keep up with various throughput.
This graph shows a spark streaming job that’s keeping up with the throughput.
Spark was more difficult to get results out of, but the results were more interesting.
The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs.
This was due to the large overhead of scheduling and launching a task for each micro-batch.
Once we increased the batch size, spark was able to keep up with various throughput.
This graph shows a spark streaming job that’s keeping up with the throughput.
If we didn’t increase the batch size enough, Spark wasn’t able to keep up with the throughput, tuples got backed up and buffered in kafka, and the latency figures increased until the job was killed.
This is a graph of a spark streaming job that’s falling behind in its processing duties, and latencies have grown to almost 70 seconds.
However, after increasing the batch size enough, Spark was able to handle more throughput than either Storm or Flink.
So… Tuning the batch size was very time consuming and frustrating. It was a manual trial-and-error process and was a big obstacle while we were testing Spark.
If the batch size was too high, latency would be high, if batch size was set too low, Spark wouldn’t keep up with the throughput, tuples would back up, and latency would be even higher.
We were trying to get fair numbers out of Spark, so we didn’t just want to turn the batch size way up. We wanted to find the lowest latency we could get for a particular throughput.
When we benchmarked Spark, there was a new feature called “backpressure” which was supposed to help address this difficulty. We tried this, but unfortunately we were unable to get it to improve our latency or prevent Spark falling behind. Instead, Spark’s backpressure actually made our numbers worse whenever we enabled it.
Storm –
Storm had results very similar to Flink. The graphs look almost identical.
The problem we found with storm was that beyond 130,00 events per second, storm couldn’t keep up with the throughput, tuples backed up, and latencies grew, just like in Spark.
This was caused by the acking system, which keeps track of successfully-processed events within storm, and performs flow control.
A new feature in 0.11 allowed us to disable acking, and it got numbers similar to Flink at throughputs up to 170,000 events per second.
Storm’s late tuple graph is, again, almost identical to Flink’s. There aren’t really any surprises here.
This is a graph comparing the 99-th percentile latencies of the various engines at different throughputs.
We can see Storm 0.11 has consistently lower latency than Flink and Spark.
Flink’s latency is comparable to Storm 0.10’s, but Flink was able to handle more throughput without falling over.
Spark had the highest latency by far, but was able to handle higher throughput than either Storm or Flink.
Future work!
So, there are a lot of variables involved and many of them we didn’t adjust.
We didn’t optimize any of the applications. They were written plainly and we didn’t really mess with the configs.
The SLA is important. SLA of 1 second is super low. We did this to try and test the low-latency limits of the low-latency systems.
Many real SLA’s are on the order of minutes, and it would be worth it to test with these SLA’s.
We expect that Spark would be more competitive in these time frames.
The throughputs we tested were incredibly high. Our highest throughput of 170,000 events per second is equivalent to 1.4 times ten to the ten events per day. Most workloads are many orders of magnitude less than that. Writing a benchmark that performs heavier computation on a smaller throughput might better reflect real workloads.
We didn’t test exactly-once semantics. This is an important feature, and something that can add a lot of overhead. Testing competing implementations could yield interesting results.
Correctness. We ran some small tests for each of the systems to ensure they were processing data correctly, but we didn’t check correctness when running the benchmarks at full scale.
The Project is open-source so you can go run your own tests; there are many many more possible configurations.
That also means you can add an implementation for your favorite streaming engine. There are a few other popular ones out there.
How do we actually measure the latency?
We start by having the producers write an integer timestamp representing the time of the event’s creation into the event. This becomes the field event_time.
We next need to understand how the windowing scheme works.
-
The window an event belongs in is determined by truncating the event_time of incoming tuples.
-
(Example)
-
If these are timestamps representing seconds, what we have then are 10-second windows of events. So in our example window here, all events with timestamps in the range of 12340 – 12349 seconds will belong to the same window.
-
Window sizes can be adjusted by truncating more or fewer digits from the timestamps. If you cut off two digits, you end up with 100-second windows. If you don’t cut off any, you end up with 1-second windows.