Apache Apex Meetup at Cask

Apex as yarn application

Chinmay Kolhatkar

Smart Partitioning with Apache Apex (Webinar)

Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA. In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.

David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another. As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing. In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity. Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects. David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.

Apache Apex: Stream Processing Architecture and Applications

Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/ This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.

Introduction to Apache Apex and writing a big data streaming application

Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application. This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc. Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices. This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.

Introduction to Apache Apex

Apache Apex Kafka Input Operator

Architectual Comparison of Apache Apex and Spark Streaming

This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.

Introduction to Apache Apex - CoDS 2016

Bhupesh Chawda

Intro to Apache Apex @ Women in Big Data

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms. Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data. Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.

Apache Apex connector with Kafka 0.9 consumer API

Capital One's Next Generation Decision in less than 2 ms

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Internet of Things (IoT) devices are becoming more ubiquitous in consumer, business and industrial landscapes. They are being widely used in applications ranging from home automation to the industrial internet. They pose a unique challenge in terms of the volume of data they produce, and the velocity with which they produce it, and the variety of sources they need to handle. The challenge is to ingest and process this data at the speed at which it is being produced in a real-time and fault tolerant fashion. Apache Apex is an industrial grade, scalable and fault tolerant big data processing platform that runs natively on Hadoop. In this deck, you will see how Apex is being used in IoT applications and also see how the enterprise features such as dimensional analytics, real-time dashboards and monitoring play a key role. Presented by Pramod Immaneni, Principal Architect at DataTorrent and PPMC member Apache Apex, on BrightTALK webinar on Apr 6th, 2016

Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex

Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with: * Architecture for high throughput, low latency and exactly-once processing semantics. * Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more * Java based with unobtrusive API to build real-time and batch applications and implement custom business logic. * Advanced engine features for auto-scaling, dynamic changes, compute locality. Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.

Integrating Apache NiFi and Apache Apex

DataTorrent Presentation @ Big Data Application Meetup

Extending The Yahoo Streaming Benchmark to Apache Apex

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc. Bio: Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.

Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

This webinar will be a hands-on demonstration of how to clone and build the Apache Apex source code repositories, how to run the maven archetype to create a new Apex project, how to enhance it to build a word counting application and finally, how to run it and view results. We will also do a brief code walkthrough. Bio: Dr. Munagala V. Ramanath is a Committer for Apache Apex and a Software Engineer at DataTorrent. He has many years experience working for a variety of companies in California and a Ph.D. in Computer Science from the University of Wisconsin, Madison.

Ingesting Data from Kafka to JDBC with Transformation and Enrichment

Presenter - Dr Sandeep Deshmukh, Committer Apache Apex, DataTorrent engineer Abstract: Ingesting and extracting data from Hadoop can be a frustrating, time consuming activity for many enterprises. Apache Apex Data Ingestion is a standalone big data application that simplifies the collection, aggregation and movement of large amounts of data to and from Hadoop for a more efficient data processing pipeline. Apache Apex Data Ingestion makes configuring and running Hadoop data ingestion and data extraction a point and click process enabling a smooth, easy path to your Hadoop-based big data project. In this series of talks, we would cover how Hadoop Ingestion is made easy using Apache Apex. The third talk in this series would focus on ingesting unbounded data from Kafka to JDBC with couple of processing operators -Transform and enrichment.

February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...

Yahoo Developer Network

Presentation on Apache Apex, the enterprise-grade big data analytics platform and how it is used in production use cases. In this talk you will learn about: • Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc • Application development model, unified approach for real-time and batch use cases • Tools for ease of use, ease of operability and ease of management • How customers use Apache Apex in production Speakers: Pramod Immaneni is Apache Apex (incubating) PPMC member, committer and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Prior to that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.

Apache Apex - Hadoop Users Group

Pramod Immaneni

What's hot

Developing streaming applications with apache apex (strata + hadoop world)

Apache Apex: Stream Processing Architecture and Applications

Introduction to Apache Apex and writing a big data streaming application

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Introduction to Apache Apex

Apache Apex Kafka Input Operator

Architectual Comparison of Apache Apex and Spark Streaming

Introduction to Apache Apex - CoDS 2016

Bhupesh Chawda

Intro to Apache Apex @ Women in Big Data

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Apache Apex connector with Kafka 0.9 consumer API

Capital One's Next Generation Decision in less than 2 ms

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex

Integrating Apache NiFi and Apache Apex

DataTorrent Presentation @ Big Data Application Meetup

Extending The Yahoo Streaming Benchmark to Apache Apex

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

Ingesting Data from Kafka to JDBC with Transformation and Enrichment

In-Memory Computing Summit

What's hot (20)

Developing streaming applications with apache apex (strata + hadoop world)

Apache Apex: Stream Processing Architecture and Applications

Introduction to Apache Apex and writing a big data streaming application

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Introduction to Apache Apex

Apache Apex Kafka Input Operator

Architectual Comparison of Apache Apex and Spark Streaming

Introduction to Apache Apex - CoDS 2016

Intro to Apache Apex @ Women in Big Data

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Apache Apex connector with Kafka 0.9 consumer API

Capital One's Next Generation Decision in less than 2 ms

IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform

Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex

Integrating Apache NiFi and Apache Apex

DataTorrent Presentation @ Big Data Application Meetup

Extending The Yahoo Streaming Benchmark to Apache Apex

Intro to Apache Apex - Next Gen Platform for Ingest and Transform

Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

Ingesting Data from Kafka to JDBC with Transformation and Enrichment

Similar to Apache Apex Meetup at Cask

February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...

Yahoo Developer Network

Apache Apex - Hadoop Users Group

Pramod Immaneni

IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...

With the advent of new open source platforms around Hadoop, NoSQL databases & in-memory databases, the data management stack in the enterprise is undergoing complete re-platforming. Batch and stream processing are two distinct data processing paradigms that need to be supported over this new stack. In this session I will talk about the importance of having a unified batch and stream processing engine and share my learning around - Sample use cases to that bring out the need to have a unified stream & batch processing engine Important features needed in the unified platform to tackle the above use cases.

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...

AboutYouGmbH

Capital One: Using Cassandra In Building A Reporting Platform

DataStax Academy

As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.

LLAP: long-lived execution in Hive

Apache Tez – Present and Future

Rajesh Balamohan

Apache Tez – Present and Future

Jianfeng Zhang

What's New in Apache Hive 3.0 - Tokyo

Apache Hive is a rapidly evolving project, many people are loved by the big data ecosystem. Hive continues to expand support for analytics, reporting, and bilateral queries, and the community is striving to improve support along with many other aspects and use cases. In this lecture, we introduce the latest and greatest features and optimization that appeared in this project last year. This includes benchmarks covering LLAP, Apache Druid's materialized views and integration, workload management, ACID improvements, using Hive in the cloud, and performance improvements. I will also tell you a little about what you can expect in the future.

What's New in Apache Hive 3.0?

Apache Flink: Past, Present and Future

Gyula Fóra

Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platformEMC

Hive 3 New Horizons DataWorks Summit Melbourne February 2019

alanfgates

What is New in Apache Hive 3.0?

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future. Speaker: Alan Gates, Co-Founder, Hortonworks

Boost Performance with Scala – Learn From Those Who’ve Done It!

Cécile Poyet

Scalding is a scala DSL for Cascading. Run on Hadoop, it’s a concise, functional, and very efficient way to build big data applications. One significant benefit of Scalding is that it allows easy porting of Scalding apps from MapReduce to newer, faster execution fabrics. In this webinar, Cyrille Chépélov, of Transparency Rights Management, will share how his organization boosted the performance of their Scalding apps by over 50% by moving away from MapReduce to Cascading 3.0 on Apache Tez. Dhruv Kumar, Hortonworks Partner Solution Engineer, will then explain how you can interact with data on HDP using Scala and leverage Scala as a programming language to develop Big Data applications.

Boost Performance with Scala – Learn From Those Who’ve Done It!

Cécile Poyet

Boost Performance with Scala – Learn From Those Who’ve Done It!

Hortonworks

SAP HANA SPS10- Enterprise Information Management

SAP Technology

What is new in Apache Hive 3.0?

Similar to Apache Apex Meetup at Cask (20)

February 2016 HUG: Apache Apex (incubating): Stream Processing Architecture a...

Apache Apex - Hadoop Users Group

IMCSummit 2015 - Day 1 IT Business Track - Designing a Big Data Analytics Pla...

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...

Capital One: Using Cassandra In Building A Reporting Platform

LLAP: long-lived execution in Hive

Apache Tez – Present and Future

What's New in Apache Hive 3.0 - Tokyo

What's New in Apache Hive 3.0?

Apache Flink: Past, Present and Future

Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform

Hive 3 New Horizons DataWorks Summit Melbourne February 2019

What is New in Apache Hive 3.0?

Boost Performance with Scala – Learn From Those Who’ve Done It!

SAP HANA SPS10- Enterprise Information Management

What is new in Apache Hive 3.0?

More from Apache Apex

Low Latency Polyglot Model Scoring using Apache Apex

Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare

The presentation covers how Apache Apex is used to deliver actionable insights in real-time for Ad-tech. It includes a reference architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture covers concepts around Apache Apex, with Kafka as source and dimensional compute. Slides from Devendra Tagare at Apache Big Data North America in Miami 2017.

Apache Big Data EU 2016: Building Streaming Applications with Apache Apex

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

Deep Dive into Apache Apex App Development

Hadoop Interacting with HDFS

Introduction to Real-Time Data Processing

Introduction to Yarn

Introduction to Map Reduce

HDFS Internals

Intro to Big Data Hadoop

Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)

Big Data Berlin v8.0 Stream Processing with Apache Apex

Ingestion and Dimensions Compute and Enrich using Apache Apex

Apache Beam (incubating)

Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.

Java High Level Stream API

Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.

Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac

Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex

Apache Apex & Bigtop

Chinmay Kolhatkar: Engineer, DataTorrent & Committer, Apache Apex For ease of use and deployment, Apache Apex leverages Apache Bigtop. Apex, being part of bigtop stack, can be easily deployed in both debian and rpm based cluster system and run validation tests for installation. This talk will cover a demo on how to install apex-bigtop and use it. It also covers a test sandbox docker environment, having pre-installed bigtop-hadoop and bigtop-apex, for quickly getting started with apex.

Building Your First Apache Apex Application