Apache Apex is an open source stream processing platform, built for large scale, high-throughput, low-latency, high availability and operability. With a unified architecture it can be used for real-time and batch processing. Apex is Java based and runs natively on Apache Hadoop YARN and HDFS.
We will discuss the key features of Apache Apex and architectural differences from similar platforms and how these differences affect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, low latency SLA, high throughput and large scale ingestion.
Apex APIs and libraries of operators and examples focus on developer productivity. We will present the programming model with examples and how custom business logic can be easily integrated based on the Apex operator API.
We will cover integration with connectors to sources/destinations (including Kafka, JMS, SQL, NoSQL, files etc.), scalability with advanced partitioning, fault tolerance and processing guarantees, computation and scheduling model, state management, windowing and dynamic changes. Attendees will also learn how these features affect time to market and total cost of ownership and how they are important in existing Apex production deployments.
https://www.bigdataspain.org/
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
This talk will be a deep dive into ingesting unbounded file data and streaming data from Kafka into Hadoop. We will also cover data enrichment and dimensional compute. Customer use-case and reference architecture.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
This document provides an overview of building an Apache Apex application, including key concepts like DAGs, operators, and ports. It also includes an example "word count" application and demonstrates how to define the application and operators, and build Apache Apex from source code. The document outlines the sample application workflow and includes information on resources for learning more about Apache Apex.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
Apache Apex is a platform and runtime engine that enables development of scalable and fault-tolerant distributed applications on Hadoop in a native fashion. It processes streaming or batch big data with high throughput and low latency. Applications are built from operators that run distributed across a cluster and can scale up or down dynamically. Apex provides automatic recovery from failures without reprocessing and preserves state. It includes a library of common operators to simplify application development.
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Ingestion and Dimensions Compute and Enrich using Apache ApexApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
This talk will be a deep dive into ingesting unbounded file data and streaming data from Kafka into Hadoop. We will also cover data enrichment and dimensional compute. Customer use-case and reference architecture.
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
This document provides an overview of building an Apache Apex application, including key concepts like DAGs, operators, and ports. It also includes an example "word count" application and demonstrates how to define the application and operators, and build Apache Apex from source code. The document outlines the sample application workflow and includes information on resources for learning more about Apache Apex.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
This document discusses Apache Apex, an open source stream processing framework. It provides an overview of stream data processing and common use cases. It then describes key Apache Apex capabilities like in-memory distributed processing, scalability, fault tolerance, and state management. The document also highlights several customer use cases from companies like PubMatic, GE, and Silver Spring Networks that use Apache Apex for real-time analytics on data from sources like IoT sensors, ad networks, and smart grids.
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacApache Apex
Apache Apex is a platform and runtime engine that enables development of scalable and fault-tolerant distributed applications on Hadoop in a native fashion. It processes streaming or batch big data with high throughput and low latency. Applications are built from operators that run distributed across a cluster and can scale up or down dynamically. Apex provides automatic recovery from failures without reprocessing and preserves state. It includes a library of common operators to simplify application development.
Architectual Comparison of Apache Apex and Spark StreamingApache Apex
This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.
Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer
Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
The document introduces Apache Apex, an open source unified streaming and batch processing framework. It discusses how Apex integrates with native Hadoop components like YARN and HDFS. It then describes Apex's programming model using directed acyclic graphs of operators and streams to process data. The document outlines Apex's support for scaling applications through partitioning, windowing, fault tolerance, and guarantees on processing semantics. It provides an example of building an application pipeline and shows the logical and physical plans. In closing, it directs the reader to Apache Apex community resources for more information.
Smart Partitioning with Apache Apex (Webinar)Apache Apex
Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA.
In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
- Apache Apex is a platform and framework for building highly scalable and fault-tolerant distributed applications on Hadoop.
- It allows developers to build any custom logic as distributed applications and ensures fault tolerance, scalability and data flow. Applications can process streaming or batch data with high throughput and low latency.
- Apex applications are composed of operators that perform processing on streams of data tuples. Operators can run in a distributed fashion across a cluster and automatically recover from failures without reprocessing data from the beginning.
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
This document discusses challenges in building low-latency machine learning applications and how Apache Apex can help address them. It introduces Apache Apex as a distributed streaming engine and describes how it allows embedding models from frameworks like R, Python, H2O through custom operators. It provides various data and model scoring patterns in Apex like dynamic resource allocation, checkpointing, exactly-once processing to meet SLAs. The document also demonstrates techniques like canary deployment, dormant models, model ensembles through logical overlays on the Apex DAG.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Capital One's Next Generation Decision in less than 2 msApache Apex
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
The presentation covers how Apache Apex is used to deliver actionable insights in real-time for Ad-tech. It includes a reference architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture covers concepts around Apache Apex, with Kafka as source and dimensional compute. Slides from Devendra Tagare at Apache Big Data North America in Miami 2017.
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics.
Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another.
As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing.
In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity.
Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects.
David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
Deep dive into how operators reads and writes from/to files in an idempotent manner. This will cover file input operator, file splitter, block reader on the input side and file output operator on the output side. We will present how these operators are made scalable and fault tolerant with the hooks provided by Apache Apex platform.
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
This talk was presented at the Apache Big Data 2016, North America conference that was held in Vancouver, CA (http://events.linuxfoundation.org/events/archive/2016/apache-big-data-north-america/program/schedule)
Extending The Yahoo Streaming Benchmark to Apache ApexApache Apex
Extending Yahoo Streaming computation Benchmark to Apache Apex
- Application topology
- Comparison of results between Storm, Flink and Apex
- Variation of the Apex Benchmarking App with event time and 'results query' support
This brochure summarizes the KKCL Juniors English language and activities program for students aged 11-17. The program focuses on "Go Create! English" lessons that integrate creative workshops like filmmaking, music, web design and interviewing to teach English in a fun way. Students take 20 hours per week of lessons, participate in cultural excursions in London, and can take exam preparation courses. The school aims to help students improve their English while discovering new talents through the creative lessons and activities.
BigDataSpain 2016: Stream Processing Applications with Apache ApexThomas Weise
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer
Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.
DataTorrent Presentation @ Big Data Application MeetupThomas Weise
The document introduces Apache Apex, an open source unified streaming and batch processing framework. It discusses how Apex integrates with native Hadoop components like YARN and HDFS. It then describes Apex's programming model using directed acyclic graphs of operators and streams to process data. The document outlines Apex's support for scaling applications through partitioning, windowing, fault tolerance, and guarantees on processing semantics. It provides an example of building an application pipeline and shows the logical and physical plans. In closing, it directs the reader to Apache Apex community resources for more information.
Smart Partitioning with Apache Apex (Webinar)Apache Apex
Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA.
In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
- Apache Apex is a platform and framework for building highly scalable and fault-tolerant distributed applications on Hadoop.
- It allows developers to build any custom logic as distributed applications and ensures fault tolerance, scalability and data flow. Applications can process streaming or batch data with high throughput and low latency.
- Apex applications are composed of operators that perform processing on streams of data tuples. Operators can run in a distributed fashion across a cluster and automatically recover from failures without reprocessing data from the beginning.
Low Latency Polyglot Model Scoring using Apache ApexApache Apex
This document discusses challenges in building low-latency machine learning applications and how Apache Apex can help address them. It introduces Apache Apex as a distributed streaming engine and describes how it allows embedding models from frameworks like R, Python, H2O through custom operators. It provides various data and model scoring patterns in Apex like dynamic resource allocation, checkpointing, exactly-once processing to meet SLAs. The document also demonstrates techniques like canary deployment, dormant models, model ensembles through logical overlays on the Apex DAG.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Capital One's Next Generation Decision in less than 2 msApache Apex
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra TagareApache Apex
The presentation covers how Apache Apex is used to deliver actionable insights in real-time for Ad-tech. It includes a reference architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture covers concepts around Apache Apex, with Kafka as source and dimensional compute. Slides from Devendra Tagare at Apache Big Data North America in Miami 2017.
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics.
Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another.
As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing.
In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity.
Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects.
David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.
Apache Apex is a stream processing framework that provides high performance, scalability, and fault tolerance. It uses YARN for resource management, can achieve single digit millisecond latency, and automatically recovers from failures without data loss through checkpointing. Apex applications are modeled as directed acyclic graphs of operators and can be partitioned for scalability. It has a large community of committers and is in the process of becoming a top-level Apache project.
Deep dive into how operators reads and writes from/to files in an idempotent manner. This will cover file input operator, file splitter, block reader on the input side and file output operator on the output side. We will present how these operators are made scalable and fault tolerant with the hooks provided by Apache Apex platform.
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
This talk was presented at the Apache Big Data 2016, North America conference that was held in Vancouver, CA (http://events.linuxfoundation.org/events/archive/2016/apache-big-data-north-america/program/schedule)
Extending The Yahoo Streaming Benchmark to Apache ApexApache Apex
Extending Yahoo Streaming computation Benchmark to Apache Apex
- Application topology
- Comparison of results between Storm, Flink and Apex
- Variation of the Apex Benchmarking App with event time and 'results query' support
This brochure summarizes the KKCL Juniors English language and activities program for students aged 11-17. The program focuses on "Go Create! English" lessons that integrate creative workshops like filmmaking, music, web design and interviewing to teach English in a fun way. Students take 20 hours per week of lessons, participate in cultural excursions in London, and can take exam preparation courses. The school aims to help students improve their English while discovering new talents through the creative lessons and activities.
BigDataSpain 2016: Stream Processing Applications with Apache ApexThomas Weise
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Negocio Digital Nicho de Mercado - Cap. 1Vitor Cazulli
O documento discute estratégias para empreendedores digitais, incluindo 10 nichos de mercado promissores e princípios para estruturar ações mentais e práticas de sucesso. É fornecido um guia detalhado sobre como identificar necessidades de mercado pouco atendidas, focar no valor entregue ao cliente, e usar estratégias comprovadas de marketing digital como geração de tráfego e construção de listas de contatos.
Este documento presenta el acuerdo de la Sala Sexta del Tribunal de Casación Penal de la provincia de Buenos Aires sobre el recurso de casación interpuesto por la defensa de R.M.B., condenada a prisión perpetua por homicidio agravado. El tribunal analiza los motivos de agravio planteados y el contexto sociocultural de R.M.B. como indígena quechua boliviana, concluyendo que se debe tener en cuenta su perspectiva cultural diversa al dictar sentencia.
La demanda contra el Estado o la República puede iniciarse cuando la actividad de los órganos estatales ocasiona daños a los particulares. Existen diferentes tipos de responsabilidad del Estado dependiendo si el daño fue causado por una función administrativa, un acto legislativo o judicial. Para que exista responsabilidad del Estado se requiere que exista un daño cierto a un interés particular, una relación de causalidad entre el daño y la conducta estatal, y que el administrado no tenga el deber jurídico de soportar el daño.
Trabajo sobre derecho financiero y derecho tributarioJhoan75
El documento describe las fuentes del Derecho Tributario y Derecho Financiero en Venezuela. El Derecho Tributario se basa principalmente en la Constitución venezolana y el Código Orgánico Tributario. El Derecho Financiero regula el uso de fondos públicos y se fundamenta en la Constitución, leyes y decretos. Ambas ramas se relacionan con otras como el Derecho Constitucional, Administrativo y Penal.
El documento contiene varias noticias breves de Ituzaingó. Se anuncia que los municipios de Morón, Pilar y Tres de Febrero subsidiarán a clubes y sociedades de fomento en un 50% del aumento de las tarifas de gas y electricidad por 90 días. También se informa sobre la asamblea pública del Consejo Económico y Social de Ituzaingó contra el aumento de tarifas y su petitorio al gobierno nacional. Por otro lado, se celebra el 50 aniversario de la Escuela Primaria
This document summarizes a presentation about taking B2B conversations online using Twitter. The presentation addressed common misconceptions about social media, such as it only being for consumer marketing. It provided examples of how B2B companies can use Twitter, including sharing content, listening to discussions, and collaborating. Specific case studies were presented, such as an airport library using Twitter to promote culture abroad, two companies collaborating on innovation topics, and a company crowdsourcing photos for a magazine cover. The overall message was that social media can be effectively used for B2B purposes when done strategically.
This short document consists of 7 photos credited to "slinky2000" and encourages the reader to create their own Haiku Deck presentation on SlideShare by providing a link labeled "GET STARTED". The repeated photos and brief message promote using Haiku Deck software to easily make slideshow presentations.
This document summarizes research on the ethicality of crisis communication strategies based on a survey of PRSA members. Key findings include:
1) The ethicality of crisis communication strategies differs based on the type of crisis situation (illegal activity, accident, or product safety issue).
2) Strategies generally seen as most ethical across all situations included corrective action, mortification, and compensation, while strategies seen as least ethical included provocation, blameshifting, and silence.
3) The research suggests a crisis communication framework focused on transparent, ethical communication to maintain public trust and repair organizational reputation after a crisis.
Paul Bochko has over 30 years of experience in packaging design and innovation for global consumer brands. He specializes in developing innovative packaging designs that maximize shelf impact while meeting quality standards, budget constraints, and sustainability goals. Bochko has held senior leadership roles at several major consumer goods companies where he directed packaging design teams, collaborated with cross-functional partners, and spearheaded projects from concept to completion.
Este documento resume un caso judicial relacionado con una película que hace referencia a la familia Martínez de Hoz. Los actores solicitan la supresión de las referencias a su familia alegando que son falsas y dañinas. El tribunal inferior desestimó la excepción de falta de legitimación activa de los demandados y rechazó los reclamos de los actores. Ambas partes apelan la sentencia.
Jennifer Mintzer Teaching Evaluations (Scantron Data)Jennifer Mintzer
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise boosts blood flow, releases endorphins, and promotes changes in the brain which help enhance one's emotional well-being and mental clarity.
The documents discuss various English grammar concepts:
1) Reported speech, also called indirect speech, is used to communicate what someone else said without using their exact words, often changing pronouns and verb tenses.
2) The past continuous tense indicates a longer action in the past was interrupted by a shorter action, usually in the simple past tense.
3) The simple present passive is formed with the subject followed by "is/are/am" and the past participle of the main verb, optionally followed by "by" and the agent. It is used for actions that occur regularly.
This document outlines the Crisis Relationship Repair Framework (CRRF), which provides guidance on effective crisis communication strategies. The CRRF is grounded in communication theory and aims to promote long-term organizational viability through a reflexive approach. It analyzes 15 image repair strategies and considers how four situational factors and three pragmatic/ethical issues influence strategy selection. The CRRF simplifies this complexity by recommending corrective action, compensation, and apology for most crises, and bolstering for healthy environments. It advocates for transparent communication and ethical performance to maintain reputation and public trust both before and after a crisis.
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
• Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc
• Application development model, unified approach for real-time and batch use cases
• Tools for ease of use, ease of operability and ease of management
• How customers use Apache Apex in production
Stream data from Apache Kafka for processing with Apache ApexApache Apex
Meetup presentation: How Apache Apex consumes from Kafka topics for real-time time processing and analytics. Learn about features of the Apex Kafka Connector, which is one of the most popular operators in the Apex Malhar operator library, and powers several production use cases. We explain the advanced features this operator provides for high throughput, low latency ingest and how it enables fault tolerant topologies with exactly once processing semantics.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Next Gen Big Data Analytics with Apache Apex discusses Apache Apex, an open source stream processing framework. It provides an overview of Apache Apex's capabilities for processing continuous, real-time data streams at scale. Specifically, it describes how Apache Apex allows for in-memory, distributed stream processing using a programming model of operators in a directed acyclic graph. It also covers Apache Apex's features for fault tolerance, dynamic scaling, and integration with Hadoop and YARN.
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder of DataTorrent presented "Streaming Analytics with Apache Apex" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role.
Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016
Real Time Insights for Advertising TechApache Apex
A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several Data centers across the world. In batch data processing, data is collected at different geographic locations and processed at regular intervals. This system brings delay of at least 1 hour before an event is accounted for.
The goal of having real time streaming was to provide publishers, Demand Side Platforms (DSP's) and agencies actionable insights in a few minutes from the time of event generation.
This Ad Tech company uses DataTorrent RTS powered by Apex for:
• Real time reporting
• Resource monitoring
• Real time learning
• Allocation engine
Tushar Gosavi from DataTorrent will take the audience through the architecture, custom operators developed, use cases for real time and the challenges involved in implementing streaming systems at scale where multiple data centers are in play.
Tushar is a Senior Engineer at DataTorrent and has worked in distributed systems and storage domains.
SnappyData is a new open source project started by Pivotal GemFire founders to build a unified cluster capable of OLTP, OLAP, and streaming analytics using Spark. SnappyData fuses an elastic, highly available in-memory store for OLTP with Spark's memory manager and query engine to provide a single system for mixed workloads with fast ingestion, high concurrency and the ability to work with live, mutable data.
Apache Apex Fault Tolerance and Processing SemanticsApache Apex
Components of an Apex application running on YARN, how they are made fault tolerant, how checkpointing works, recovery from failures, incremental recovery, processing guarantees.
Low latency high throughput streaming using Apache Apex and Apache KuduDataWorks Summit
True streaming is fast becoming a necessity for many business use cases. On the other hand the data set sizes and volumes are also growing exponentially compounding the complexity of data processing pipelines.There exists a need for true low latency streaming coupled with very high throughput data processing. Apache Apex as a low latency and high throughput data processing framework and Apache Kudu as a high throughput store form a nice combination which solves this pattern very efficiently.
This session will walk through a use case which involves writing a high throughput stream using Apache Kafka,Apache Apex and Apache Kudu. The session will start with a general overview of Apache Apex and capabilities of Apex that form the foundation for a low latency and high throughput engine with Apache kafka being an example input source of streams. Subsequently we walk through Kudu integration with Apex by walking through various patterns like end to end exactly once, selective column writes and timestamp propagations for out of band data. The session will also cover additional patterns that this integration will cover for enterprise level data processing pipelines.
The session will conclude with some metrics for latency and throughput numbers for the use case that is presented.
Speaker
Ananth Gundabattula, Senior Architect, Commonwealth Bank of Australia
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
Apache Spark 2.0 offers many enhancements that make continuous analytics quite simple. In this talk, we will discuss many other things that you can do with your Apache Spark cluster. We explain how a deep integration of Apache Spark 2.0 and in-memory databases can bring you the best of both worlds! In particular, we discuss how to manage mutable data in Apache Spark, run consistent transactions at the same speed as state-the-art in-memory grids, build and use indexes for point lookups, and run 100x more analytics queries at in-memory speeds. No need to bridge multiple products or manage, tune multiple clusters. We explain how one can take regulation Apache Spark SQL OLAP workloads and speed them up by up to 20x using optimizations in SnappyData.
We then walk through several use-case examples, including IoT scenarios, where one has to ingest streams from many sources, cleanse it, manage the deluge by pre-aggregating and tracking metrics per minute, store all recent data in a in-memory store along with history in a data lake and permit interactive analytic queries at this constantly growing data. Rather than stitching together multiple clusters as proposed in Lambda, we walk through a design where everything is achieved in a single, horizontally scalable Apache Spark 2.0 cluster. A design that is simpler, a lot more efficient, and let’s you do everything from Machine Learning and Data Science to Transactions and Visual Analytics all in one single cluster.
Why does big data always have to go through a pipeline? multiple data copies, slow, complex and stale analytics? We present a unified analytics platform that brings streaming, transactions and adhoc OLAP style interactive analytics in a single in-memory cluster based on Spark.
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
Discover how to avoid common pitfalls when shifting to an event-driven architecture (EDA) in order to boost system recovery and scalability. We cover Kafka Schema Registry, in-broker transformations, event sourcing, and more.
MySQL Cluster Carrier Grade Edition is a real-time database designed for the telecom industry that provides the flexibility of a relational database with the cost savings of open source. It is suited for large carriers and operators and uses a distributed, synchronous storage architecture with automated failover capability. It offers high performance, scalability and availability across geographies through asynchronous data replication between clusters.
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022HostedbyConfluent
Azure Event Hubs - Behind the Scenes With Kasun Indrasiri | Current 2022
Azure Event Hubs is a hyperscale PaaS event stream broker with protocol support for HTTP, AMQP, and Apache Kafka RPC that accepts and forwards several trillion (!) events per day and is available in all global Azure regions. This session is a look behind the curtain where we dive deep into the architecture of Event Hubs and look at the Event Hubs cluster model, resource isolation, and storage strategies and also review some performance figures.
Lessons learned from embedding Cassandra in xPatternsClaudiu Barbura
The document discusses lessons learned from embedding Cassandra in the xPatterns big data analytics platform. It provides an agenda that includes discussing Cassandra usage in xPatterns, the necessary developments like data modeling optimizations, robust REST APIs, geo-replication, and a demo of exporting to NoSQL APIs. Key lessons learned since Cassandra versions 0.6 to 2.0.6 are also summarized, such as the need for consistent clocks, reducing column families, and monitoring.
The document discusses Cassandra and the xPatterns architecture. It describes exporting data from HDFS/Hive/Shark to Cassandra using a custom Spark job and generating REST APIs. A demo of a referral provider network dashboard is shown, which was built using exported Cassandra data to analyze medical records and provider relationships. Lessons learned from optimizing Cassandra performance from versions 0.6 to 2.0.6 are also discussed.
Learn how Aerospike's Hybrid Memory Architecture brings transactions and analytics together to power real-time Systems of Engagement ( SOEs) for companies across AdTech, financial services, telecommunications, and eCommerce. We take a deep dive into the architecture including use cases, topology, Smart Clients, XDR and more. Aerospike delivers predictable performance, high uptime and availability at the lowest total cost of ownership (TCO).
Similar to BigDataSpain 2016: Introduction to Apache Apex (20)
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfTechgropse Pvt.Ltd.
In this blog post, we'll delve into the intersection of AI and app development in Saudi Arabia, focusing on the food delivery sector. We'll explore how AI is revolutionizing the way Saudi consumers order food, how restaurants manage their operations, and how delivery partners navigate the bustling streets of cities like Riyadh, Jeddah, and Dammam. Through real-world case studies, we'll showcase how leading Saudi food delivery apps are leveraging AI to redefine convenience, personalization, and efficiency.
CAKE: Sharing Slices of Confidential Data on BlockchainClaudio Di Ciccio
Presented at the CAiSE 2024 Forum, Intelligent Information Systems, June 6th, Limassol, Cyprus.
Synopsis: Cooperative information systems typically involve various entities in a collaborative process within a distributed environment. Blockchain technology offers a mechanism for automating such processes, even when only partial trust exists among participants. The data stored on the blockchain is replicated across all nodes in the network, ensuring accessibility to all participants. While this aspect facilitates traceability, integrity, and persistence, it poses challenges for adopting public blockchains in enterprise settings due to confidentiality issues. In this paper, we present a software tool named Control Access via Key Encryption (CAKE), designed to ensure data confidentiality in scenarios involving public blockchains. After outlining its core components and functionalities, we showcase the application of CAKE in the context of a real-world cyber-security project within the logistics domain.
Paper: https://doi.org/10.1007/978-3-031-61000-4_16
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
Things to Consider When Choosing a Website Developer for your Website | FODUUFODUU
Choosing the right website developer is crucial for your business. This article covers essential factors to consider, including experience, portfolio, technical skills, communication, pricing, reputation & reviews, cost and budget considerations and post-launch support. Make an informed decision to ensure your website meets your business goals.
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceIndexBug
Imagine a world where machines not only perform tasks but also learn, adapt, and make decisions. This is the promise of Artificial Intelligence (AI), a technology that's not just enhancing our lives but revolutionizing entire industries.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxSitimaJohn
Ocean Lotus cyber threat actors represent a sophisticated, persistent, and politically motivated group that poses a significant risk to organizations and individuals in the Southeast Asian region. Their continuous evolution and adaptability underscore the need for robust cybersecurity measures and international cooperation to identify and mitigate the threats posed by such advanced persistent threat groups.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
How to Get CNIC Information System with Paksim Ga.pptx
BigDataSpain 2016: Introduction to Apache Apex
1. Introduction to Apache Apex
Thomas Weise <thw@apache.org> @thweise
PMC Chair Apache Apex, Architect DataTorrent
Big Data Spain, Madrid, Nov 17th 2016
2. Stream Data Processing
2
Data
Sources
Events
Logs
Sensor Data
Social
Databases
CDC
Oper1 Oper2 Oper3
Real-time
visualization, …
Data Delivery Transform / Analytics
SQL
Declarative
API
DAG API
SAMOA
Beam
Operator
Library
SAMOA
Beam
(roadmap)
3. Industries & Use Cases
3
Financial Services Ad-Tech Telecom Manufacturing Energy IoT
Fraud and risk
monitoring
Real-time
customer facing
dashboards on
key performance
indicators
Call detail record
(CDR) &
extended data
record (XDR)
analysis
Supply chain
planning &
optimization
Smart meter
analytics
Data ingestion
and processing
Credit risk
assessment
Click fraud
detection
Understanding
customer
behavior AND
context
Preventive
maintenance
Reduce outages
& improve
resource
utilization
Predictive
analytics
Improve turn around
time of trade
settlement processes
Billing
optimization
Packaging and
selling
anonymous
customer data
Product quality &
defect tracking
Asset &
workforce
management
Data governance
• Large scale ingest and distribution
• Real-time ELTA (Extract Load Transform Analyze)
• Dimensional computation & aggregation
• Enforcing data quality and data governance requirements
• Real-time data enrichment with reference data
• Real-time machine learning model scoring
HORIZONTAL
4. Apache Apex
4
• In-memory, distributed, parallel stream processing
• Application logic broken into components (operators) that execute distributed in a cluster
• Unobtrusive Java API to express (custom) logic
• Maintain state and metrics in member variables
• Windowing, event-time processing
• Scalable, high throughput, low latency
• Operators can be scaled up or down at runtime according to the load and SLA
• Dynamic scaling (elasticity), compute locality
• Fault tolerance & correctness
• Automatically recover from node outages without having to reprocess from beginning
• State is preserved, checkpointing, incremental recovery
• End-to-end exactly-once
• Operability
• System and application metrics, record/visualize data
• Dynamic changes and resource allocation, elasticity
6. Application Development Model
6
A Stream is a sequence of data
tuples
A typical Operator takes one or
more input streams, performs
computations & emits one or more
output streams
• Each Operator is YOUR custom
business logic in java, or built-in
operator from our open source
library
• Operator has many instances
that run in parallel and each
instance is single-threaded
Directed Acyclic Graph (DAG) is
made up of operators and streams
Directed Acyclic Graph (DAG)
Operator Operator
Operator
Operator
Operator Operator
Tuple
Output
Stream
Filtered
Stream
Enriched
Stream
Filtered
Stream
Enriched
Stream
12. Windowing - Apache Beam Model
12
ApexStream<String> stream = StreamFactory
.fromFolder(localFolder)
.flatMap(new Split())
.window(new WindowOption.GlobalWindow(), new
TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes())
.countByKey(new ConvertToKeyVal()).print();
Event-time
Session windows
Watermarks
Accumulation
Triggers
Keyed or Not Keyed
Allowed Lateness
Accumulation Mode
Merging streams
13. Fault Tolerance
13
• Operator state is checkpointed to persistent store
ᵒ Automatically performed by engine, no additional coding needed
ᵒ Asynchronous and distributed
ᵒ In case of failure operators are restarted from checkpoint state
• Automatic detection and recovery of failed containers
ᵒ Heartbeat mechanism
ᵒ YARN process status notification
• Buffering to enable replay of data from recovered point
ᵒ Fast, incremental recovery, spike handling
• Application master state checkpointed
ᵒ Snapshot of physical (and logical) plan
ᵒ Execution layer change log
17. Processing Guarantees
17
At-least-once
• On recovery data will be replayed from a previous checkpoint
ᵒ No messages lost
ᵒ Default, suitable for most applications
• Can be used to ensure data is written once to store
ᵒ Transactions with meta information, Rewinding output, Feedback from
external entity, Idempotent operations
At-most-once
• On recovery the latest data is made available to operator
ᵒ Useful in use cases where some data loss is acceptable and latest data is
sufficient
Exactly-once
ᵒ At-least-once processing + idempotency + transactional mechanisms
(operator logic) to achieve end-to-end exactly once behavior
18. End-to-End Exactly Once
18
• Important when writing to external systems
• Data should not be duplicated or lost in the external system in case of
application failures
• Common external systems
ᵒ Databases
ᵒ Files
ᵒ Message queues
• Exactly-once results = at-least-once + idempotency + consistent state
• Data duplication must be avoided when data is replayed from checkpoint
ᵒ Operators implement the logic dependent on the external system
ᵒ Platform provides checkpointing and repeatable windowing
19. Scalability
19
NxM PartitionsUnifier
0 1 2 3
Logical DAG
0 1 2
1
1 Unifier
1
20
Logical Diagram
Physical Diagram with operator 1 with 3 partitions
0
Unifier
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): No bottleneck
Unifier
Unifier0
1a
1b
1c
2a
2b
Unifier 3
Physical DAG with (1a, 1b, 1c) and (2a, 2b): Bottleneck on intermediate Unifier
20. Advanced Partitioning
20
0
1a
1b
2 3 4Unifier
Physical DAG
0 4
3a2a1a
1b 2b 3b
Unifier
Physical DAG with Parallel Partition
Parallel Partition
Container
uopr
uopr1
uopr2
uopr3
uopr4
uopr1
uopr2
uopr3
uopr4
dopr
dopr
doprunifier
unifier
unifier
unifier
Container
Container
NICNIC
NICNIC
NIC
Container
NIC
Logical Plan
Execution Plan, for N = 4; M = 1
Execution Plan, for N = 4; M = 1, K = 2 with cascading unifiers
Cascading Unifiers
0 1 2 3 4
Logical DAG
21. Dynamic Partitioning
21
• Partitioning change while application is running
ᵒ Change number of partitions at runtime based on stats
ᵒ Determine initial number of partitions dynamically
• Kafka operators scale according to number of kafka partitions
ᵒ Supports re-distribution of state when number of partitions change
ᵒ API for custom scaler or partitioner
2b
2c
3
2a
2d
1b
1a1a 2a
1b 2b
3
1a 2b
1b 2c 3b
2a
2d
3a
Unifiers not shown
22. How dynamic partitioning works
22
• Partitioning decision (yes/no) by trigger (StatsListener)
ᵒ Pluggable component, can use any system or custom metric
ᵒ Externally driven partitioning example: KafkaInputOperator
• Stateful!
ᵒ Uses checkpointed state
ᵒ Ability to transfer state from old to new partitions (partitioner, customizable)
ᵒ Steps:
• Call partitioner
• Modify physical plan, rewrite checkpoints as needed
• Undeploy old partitions from execution layer
• Release/request container resources
• Deploy new partitions (from rewritten checkpoint)
ᵒ No loss of data (buffered)
ᵒ Incremental operation, partitions that don’t change continue processing
• API: Partitioner interface
23. Compute Locality
23
• Host Locality
ᵒ Operators can be deployed on specific hosts
• (Anti-)Affinity
ᵒ Ability to express relative deployment without specifying a host
Default
(serialization+IPC)
HOST
(serialization, loopback)
CONTAINER
(in-process queue)
THREAD
(callstack)
• By default operators are distributed on different nodes in the cluster
• Can be collocated on machine, container or thread basis for efficiency
25. Performance: Throughput vs. Latency?
25
https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
computation-engines-at
http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
26. 26
Apex, Flink w/ 4 Kafka brokers
2.7 million events/second, Kafka latency limit
Apex w/o Kafka and Redis:
43 million events/second with more than 90
percent of events processed with the latency
less than 0.5 seconds
High-Throughput and Low-Latency
https://www.datatorrent.com/blog/throughput-latency-and-yahoo/
27. Recent Additions & Roadmap
27
• Declarative Java API
• Windowing Semantics following Beam model
• Scalable state management
• SQL support using Apache Calcite
• Apache Beam Runner, SAMOA integration
• Enhanced support for Batch Processing
• Support for Mesos
• Encrypted Streams
• Python support for operator logic and API
• Replacing operator code at runtime
• Dynamic attribute changes
• Named checkpoints
31. Who is using Apex?
31
• Powered by Apex
• http://apex.apache.org/powered-by-apex.html
• Also using Apex? Let us know to be added: users@apex.apache.org or @ApacheApex
• Pubmatic
• https://www.youtube.com/watch?v=JSXpgfQFcU8
• GE
• https://www.youtube.com/watch?v=hmaSkXhHNu0
• http://www.slideshare.net/ApacheApex/ge-iot-predix-time-series-data-ingestion-service-using-
apache-apex-hadoop
• SilverSpring Networks
• https://www.youtube.com/watch?v=8VORISKeSjI
• http://www.slideshare.net/ApacheApex/iot-big-data-ingestion-and-processing-in-hadoop-by-
silver-spring-networks
32. Maximize Revenue w/ real-time insights
32
PubMatic is the leading marketing automation software company for publishers. Through real-time analytics,
yield management, and workflow automation, PubMatic enables publishers to make smarter inventory
decisions and improve revenue performance
Business Need Apex based Solution Client Outcome
• Ingest and analyze high volume clicks &
views in real-time to help customers
improve revenue
- 200K events/second data
flow
• Report critical metrics for campaign
monetization from auction and client
logs
- 22 TB/day data generated
• Handle ever increasing traffic with
efficient resource utilization
• Always-on ad network, feedback loop
for ad server
• DataTorrent Enterprise platform,
powered by Apache Apex
• In-memory stream processing
• Comprehensive library of pre-built
operators including connectors
• Built-in fault tolerance
• Dynamically scalable
• Real-time query from in-memory state
• Management UI & Data Visualization
console
• Helps PubMatic deliver ad performance
insights to publishers and advertisers in
real-time instead of 5+ hours
• Helps Publishers visualize campaign
performance and adjust ad inventory in
real-time to maximize their revenue
• Enables PubMatic reduce OPEX with
efficient compute resource utilization
• Built-in fault tolerance ensures
customers can always access ad
network
33. Industrial IoT applications
33
GE is dedicated to providing advanced IoT analytics solutions to thousands of customers who are using their
devices and sensors across different verticals. GE has built a sophisticated analytics platform, Predix, to help its
customers develop and execute Industrial IoT applications and gain real-time insights as well as actions.
Business Need Apex based Solution Client Outcome
• Ingest and analyze high-volume, high speed
data from thousands of devices, sensors
per customer in real-time without data loss
• Predictive analytics to reduce costly
maintenance and improve customer
service
• Unified monitoring of all connected sensors
and devices to minimize disruptions
• Fast application development cycle
• High scalability to meet changing business
and application workloads
• Ingestion application using DataTorrent
Enterprise platform
• Powered by Apache Apex
• In-memory stream processing
• Built-in fault tolerance
• Dynamic scalability
• Comprehensive library of pre-built
operators
• Management UI console
• Helps GE improve performance and lower
cost by enabling real-time Big Data
analytics
• Helps GE detect possible failures and
minimize unplanned downtimes with
centralized management & monitoring of
devices
• Enables faster innovation with short
application development cycle
• No data loss and 24x7 availability of
applications
• Helps GE adjust to scalability needs with
auto-scaling
34. Smart energy applications
34
Silver Spring Networks helps global utilities and cities connect, optimize, and manage smart energy and smart city
infrastructure. Silver Spring Networks receives data from over 22 million connected devices, conducts 2 million
remote operations per year
Business Need Apex based Solution Client Outcome
• Ingest high-volume, high speed data from
millions of devices & sensors in real-time
without data loss
• Make data accessible to applications
without delay to improve customer service
• Capture & analyze historical data to
understand & improve grid operations
• Reduce the cost, time, and pain of
integrating with 3rd party apps
• Centralized management of software &
operations
• DataTorrent Enterprise platform, powered
by Apache Apex
• In-memory stream processing
• Pre-built operators/connectors
• Built-in fault tolerance
• Dynamically scalable
• Management UI console
• Helps Silver Spring Networks ingest &
analyze data in real-time for effective load
management & customer service
• Helps Silver Spring Networks detect
possible failures and reduce outages with
centralized management & monitoring of
devices
• Enables fast application development for
faster time to market
• Helps Silver Spring Networks scale with
easy to partition operators
• Automatic recovery from failures