Extending The Yahoo Streaming Benchmark to Apache Apex

•

4 likes•1,619 views

Extending Yahoo Streaming computation Benchmark to Apache Apex - Application topology - Comparison of results between Storm, Flink and Apex - Variation of the Apex Benchmarking App with event time and 'results query' support

Technology

Extending the Yahoo
Streaming Benchmark for Apache Apex
San Jose Apache Apex Meetup
May 4th
2016
Sandesh Hegde
sandesh@apache.org

Background
• Yahoo created a benchmark to compare Stream processing systems and
compared Storm, Flink and Spark Streaming [1]
• dataArtisans extended the benchmark by comparing Flink and Storm with
different scenarios [2]
• No benchmark comparison about Stream processing is complete without
including Apache Apex.
2

Yahoo Streaming Benchmark
Simple Advertisement Application : To see how many times an ad
campaign has been seen in an window.
• Read ads from Kafka
• Deserialize JSON string
• Filter unnecessary ads
• Projection of Fields ( remove non-essential fields )
• Join ad id with campaign id from Redis
• Windowed count per campaign and output to Redis
3

Application - with Kafka
4
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields

$Setup • Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz • 10GigE Between compute nodes • 4 Kafka Brokers ( 2 Partitions each & 1 Replica ) • Kafka Version : 0.8.2 • Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 ) • Yarn-Containers size: 16GB • 1 ZooKeeper • Message Size: 218 Bytes • Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c"," page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":" 600589859","ad_type":"banner78","event_type":"purchase","event_time":" 1462374087774","ip_address":"1.2.3.4"} 5$

Quick Primer on Locality
8
• CONTAINER_LOCAL
■ Deployed in the same process, different threads
■ No serialization
■ Queue between the operators
• THREAD_LOCAL
■ Same thread
■ No serialization
■ Use it only when operators do light work
Note: [New feature] Anti Affinity is not covered here.

Benchmarking Against Previous Releases
9
https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
Part of Release Certification

Application : with Kafka
10
https://github.com/sandeshh/streaming-benchmarks

Application - With Generator
11
Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields
Generator

Application - With Generator
12
https://github.com/sandeshh/streaming-benchmarks
Setup: Single Partition

State of the Art & Streaming
13
Generator Filter Redis OutputRedis JoinFilter Fields
What’s our recommendation to query the State?
In memory Key-Value store in the operators?

Application - State Store & Query
14
Generator Filter
Dimensional
Computation
Redis JoinFilter Fields Store (HDHT) QueryResult
1. Durable state ( HDHT is a key value store native to Hadoop ) [4]
2. Single System, scales with your application
3. Easy integration with external Consoles [7]
4. Low operability cost
5. Complex Dimensional Computation [5][6]

References
17
1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/
4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/
5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/
6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2-
implementation/
7. http://docs.datatorrent.com/app_data_framework/

© 2016 DataTorrent
Resources
18
• Apache Apex website - http://apex.apache.org/
• Subscribe - http://apex.apache.org/community.html
• Download - http://apex.apache.org/downloads.html
• Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex
• Facebook - https://www.facebook.com/ApacheApex/
• Meetup - http://www.meetup.com/topics/apache-apex
• Free Enterprise License for Startups - https://www.datatorrent.com/product/startup-
accelerator/

© 2016 DataTorrent
We Are Hiring
19
• jobs@datatorrent.com
• Developers/Architects
• QA Automation Developers
• Information Developers
• Build and Release
• Community Leaders

What's hot

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Apache Apex

Introduction to Apache Apex

Apache Apex

Building your first aplication using Apache Apex

Yogi Devendra Vyavahare

Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac

Apache Apex

Introduction to Apache Apex - CoDS 2016

Bhupesh Chawda

Apex as yarn application

Chinmay Kolhatkar

Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms. Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data. Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Apache Apex

Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application. This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc. Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices. This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com

Introduction to Apache Apex and writing a big data streaming application

Apache Apex

DataTorrent Presentation @ Big Data Application Meetup

Thomas Weise

Presenter - Siyuan Hua, Apache Apex PMC Member & DataTorrent Engineer Apache Apex provides a DAG construction API that gives the developers full control over the logical plan. Some use cases don't require all of that flexibility, at least so it may appear initially. Also a large part of the audience may be more familiar with an API that exhibits more functional programming flavor, such as the new Java 8 Stream interfaces and the Apache Flink and Spark-Streaming API. Thus, to make Apex beginners to get simple first app running with familiar API, we are now providing the Stream API on top of the existing DAG API. The Stream API is designed to be easy to use yet flexible to extend and compatible with the native Apex API. This means, developers can construct their application in a way similar to Flink, Spark but also have the power to fine tune the DAG at will. Per our roadmap, the Stream API will closely follow Apache Beam (aka Google Data Flow) model. In the future, you should be able to either easily run Beam applications with the Apex Engine or express an existing application in a more declarative style.

Java High Level Stream API

Apache Apex

The presentation covers how Apache Apex is used to deliver actionable insights in real-time for Ad-tech. It includes a reference architecture to provide dimensional aggregates on TB scale for billions of events per day. The reference architecture covers concepts around Apache Apex, with Kafka as source and dimensional compute. Slides from Devendra Tagare at Apache Big Data North America in Miami 2017.

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare

Apache Apex

This webinar will be a hands-on demonstration of how to clone and build the Apache Apex source code repositories, how to run the maven archetype to create a new Apex project, how to enhance it to build a word counting application and finally, how to run it and view results. We will also do a brief code walkthrough. Bio: Dr. Munagala V. Ramanath is a Committer for Apache Apex and a Software Engineer at DataTorrent. He has many years experience working for a variety of companies in California and a Ph.D. in Computer Science from the University of Wisconsin, Madison.

Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

Apache Apex

Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/ This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.

Apache Apex: Stream Processing Architecture and Applications

Thomas Weise

Processing big data often requires running the same computations parallelly in multiple processes or threads, called partitions, with each partition handling a subset of the data. This becomes all the more necessary when processing live data streams where maintaining SLA is paramount. Furthermore, multiple different computations make up an application and each of them may have different partitioning needs. Partitioning also needs to adapt to changing data rates, input sources and other application requirements like SLA. In this talk, we will introduce how Apache Apex, a distributed stream processing platform on Hadoop, handles partitioning. We will look at different partitioning schemes provided by Apex some of which are unique in this space. We will also look at how Apex does dynamic partitioning, a feature unique to and pioneered by Apex to handle varying data needs with examples. We will also talk about the different utilities and libraries that Apex provides for users to be able to affect their own custom partitioning.

Smart Partitioning with Apache Apex (Webinar)

Apache Apex

This presentation discusses architectural differences between Apache Apex features with Spark Streaming. It discusses how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion. Also, it will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. Further, it will discuss how these features affect time to market and total cost of ownership.

Architectual Comparison of Apache Apex and Spark Streaming

Apache Apex

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

Apache Apex

David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. Apache Apex uses a programming paradigm based on a directed acyclic graph (DAG). Each node in the DAG represents an operator, which can be data input, data output, or data transformation. Each directed edge in the DAG represents a stream, which is the flow of data from one operator to another. As part of Apex, the Malhar library provides a suite of connector operators so that Apex applications can read from or write to various data sources. It also includes utility operators that are commonly used in streaming applications, such as parsers, deduplicators and join, and generic building blocks that facilitate scalable state management and checkpointing. In addition to processing based on ingression time and processing time, Apex supports event-time windows and session windows. It also supports windowing, watermarks, allowed lateness, accumulation mode, triggering, and retraction detailed by Apache Beam as well as feedback loops in the DAG for iterative processing and at-least-once and “end-to-end” exactly-once processing guarantees. Apex provides various ways to fine-tune applications, such as operator partitioning, locality, and affinity. Apex is integrated with several open source projects, including Apache Beam, Apache Samoa (distributed machine learning), and Apache Calcite (SQL-based application specification). Users can choose Apex as the backend engine when running their application model based on these projects. David explains how to develop fault-tolerant streaming applications with low latency and high throughput using Apex, presenting the programming model with examples and demonstrating how custom business logic can be integrated using both the declarative high-level API and the compositional DAG-level API.

Developing streaming applications with apache apex (strata + hadoop world)

Apache Apex

Ingestion and Dimensions Compute and Enrich using Apache Apex

Apache Apex

Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)

Apache Apex

This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Apache Apex

What's hot (20)

From Batch to Streaming with Apache Apex Dataworks Summit 2017

Introduction to Apache Apex

Building your first aplication using Apache Apex

Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac

Introduction to Apache Apex - CoDS 2016

Apex as yarn application

Intro to Apache Apex (next gen Hadoop) & comparison to Spark Streaming

Introduction to Apache Apex and writing a big data streaming application

DataTorrent Presentation @ Big Data Application Meetup

Java High Level Stream API

Actionable Insights with Apache Apex at Apache Big Data 2017 by Devendra Tagare

Building Your First Apache Apex (Next Gen Big Data/Hadoop) Application

Apache Apex: Stream Processing Architecture and Applications

Smart Partitioning with Apache Apex (Webinar)

Architectual Comparison of Apache Apex and Spark Streaming

Kafka to Hadoop Ingest with Parsing, Dedup and other Big Data Transformations

Developing streaming applications with apache apex (strata + hadoop world)

Ingestion and Dimensions Compute and Enrich using Apache Apex

Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)

Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex

Viewers also liked

Apache Apex (http://apex.incubator.apache.org/) is an open source stream processing and next generation analytics platform incubating at the Apache Software Foundation. Apex is Hadoop native and was built from ground up for scalability, low-latency processing, high availability and operability. In this webinar, you will learn about Apache Apex fault tolerance, high availability and processing guarantees. From the users perspective, fault tolerance of a stream processing platform should cover the state of the application/processor and the in-flight data. In the event of failure, the platform should recover, restore state and resume processing with no loss of data. We will cover: * Components of an Apex application and how they are made fault tolerant * How native YARN support is leveraged for fault tolerance * How operator checkpointing works and how the user can tune it * Failure scenarios, recovery from failures, incremental recovery * Processing guarantees and which option is appropriate for your application * Sample topology for highly available, low latency real-time processing * How is fault-tolerance in Apex different from similar platforms such as Storm, Spark Streaming and Flink. Presented by Thomas Weise, Architect & Co-founder; Pramod Immaneni, Architect on BrightTALK on Mar 24th, 2016

Apache Apex Fault Tolerance and Processing Semantics

Apache Apex

Extending the Yahoo Streaming Benchmark

Jamie Grier

Windowing in Apache Apex

Apache Apex

Apache Apex Fault Tolerance and Processing Semantics

Apache Apex

Stream Processing use cases and applications with Apache Apex by Thomas Weise

Big Data Spain

Deep Dive into Apache Apex App Development

Apache Apex

This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc. Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex. Speakers: - Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex - Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)

Apache Apex

Capital One's Next Generation Decision in less than 2 ms

Apache Apex

最近のストリーム処理事情振り返り

Sotaro Kimura

Viewers also liked (9)

Apache Apex Fault Tolerance and Processing Semantics

Extending the Yahoo Streaming Benchmark

Windowing in Apache Apex

Apache Apex Fault Tolerance and Processing Semantics

Stream Processing use cases and applications with Apache Apex by Thomas Weise

Deep Dive into Apache Apex App Development

GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)

Capital One's Next Generation Decision in less than 2 ms

最近のストリーム処理事情振り返り

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way

smalltown

Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14

p6academy

Fabian Hueske – Cascading on Flink

Flink Forward

Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.

Apache Big Data EU 2016: Building Streaming Applications with Apache Apex

Apache Apex

ansible_rhel_90.pdf

ssuserd254491

23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET

Is Antipov

Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...

LogeekNightUkraine

1 extreme performance - part i

sqlserver.co.il

Web Scale Reasoning and the LarKC Project

Saltlux Inc.

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Jamie Grier

Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020) https://www.linkedin.com/in/erenavsarogullari/ https://www.linkedin.com/in/pavelhardak/ Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.

Spark Development Lifecycle at Workday - ApacheCon 2020

Pavel Hardak

Presented by Pavel Hardak and Eren Avsarogullari (ApacheCon 2020) https://www.linkedin.com/in/pavelhardak/ https://www.linkedin.com/in/erenavsarogullari/ Title: Apache Spark Development Lifecycle at Workday Abstract: Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

Eren Avşaroğulları

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

Spark Summit

Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance.

SharePoint 2010 Boost your farm performance!

Brian Culver

Alex Zeltov - Intro to Big Data Analytics using Microsoft Machine Learning Server with Spark By combining enterprise-scale R analytics software with the power of Apache Hadoop and Apache Spark, Microsoft R Server for HDP or HDInsight gives you the scale and performance you need. Multi-threaded math libraries and transparent parallelization in R Server handle up to 1000x more data and up to 50x faster speeds than open-source R, which helps you to train more accurate models for better predictions. R Server works with the open-source R language, so all of your R scripts run without changes. Microsoft Machine Learning Server is your flexible enterprise platform for analyzing data at scale, building intelligent apps, and discovering valuable insights across your business with full support for Python and R. Machine Learning Server meets the needs of all constituents of the process – from data engineers and data scientists to line-of-business programmers and IT professionals. It offers a choice of languages and features algorithmic innovation that brings the best of open source and proprietary worlds together. R support is built on a legacy of Microsoft R Server 9.x and Revolution R Enterprise products. Significant machine learning and AI capabilities enhancements have been made in every release. In 9.2.1, Machine Learning Server adds support for the full data science lifecycle of your Python-based analytics. This meetup will NOT be a data science intro or R intro to programming. It is about working with data and big data on MLS . - How to Scale R - Work with R and Hadoop + Spark -Demo of MLS on HDP/HDInsight server with RStudio - How to operationalize deploying models using MLS Webservice operationalization features on MLS Server or on the cloud Azure ML (PaaS) offering. Speaker Bio: Alex Zeltov is Big Data Solutions Architect / Software Engineer / Programmer Analyst / Data Scientist with over 19 years of industry experience in Information Technology and most recently in Big Data and Predictive Analytics. He currently works as Global black belt Technical Specialist in Microsoft where he concentrates on Big Data and Advanced Analytics use cases. Previously to joining Microsoft he worked as a Sr. Solutions Engineer at Hortonworks where he specialized in HDP and HDF platforms.

Intro to big data analytics using microsoft machine learning server with spark

Alex Zeltov

Ml2

poovarasu maniandan

WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More

WSO2

What's New in .Net 4.5

Malam Team

Testing in the Cloud using Panda

Tao Jiang

Supreet Oberoi's presentation on "Large scale log processing with Cascading & Elastic Search". Elasticsearch is becoming a popular platform for log analysis with its ELK stack: Elasticsearch for search, Logstash for centralized logging, and Kibana for visualization. Complemented with Cascading, the application development platform for building Data applications on Apache Hadoop, developers can correlate at scale multiple log and data streams to perform rich and complex log processing before making it available to the ELK stack.

Elasticsearch + Cascading for Scalable Log Processing

Cascading

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex (20)

DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way

Primavera gateway SAP provider - Oracle Primavera P6 Collaborate 14

Fabian Hueske – Cascading on Flink

Apache Big Data EU 2016: Building Streaming Applications with Apache Apex

ansible_rhel_90.pdf

23.03.2015 Minsk .NET Meetup #15: Interop Between Scala and .NET

Eugene Bova "Dapr (Distributed Application Runtime) in a Microservices Archit...

1 extreme performance - part i

Web Scale Reasoning and the LarKC Project

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Spark Development Lifecycle at Workday - ApacheCon 2020

Apache Spark Development Lifecycle @ Workday - ApacheCon 2020

A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)

SharePoint 2010 Boost your farm performance!

Intro to big data analytics using microsoft machine learning server with spark

Ml2

WSO2 Stream Processor: Graphical Editor, HTTP & Message Trace Analytics and More

What's New in .Net 4.5

Testing in the Cloud using Panda

Elasticsearch + Cascading for Scalable Log Processing

More from Apache Apex

Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

Low Latency Polyglot Model Scoring using Apache Apex

Apache Apex

Hadoop Interacting with HDFS

Apache Apex

Introduction to Real-Time Data Processing

Apache Apex

Introduction to Yarn

Apache Apex

Introduction to Map Reduce

Apache Apex

HDFS Internals

Apache Apex

Intro to Big Data Hadoop

Apache Apex

Big Data Berlin v8.0 Stream Processing with Apache Apex

Apache Apex

Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.

Apache Beam (incubating)

Apache Apex

Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex

Apache Apex

Chinmay Kolhatkar: Engineer, DataTorrent & Committer, Apache Apex For ease of use and deployment, Apache Apex leverages Apache Bigtop. Apex, being part of bigtop stack, can be easily deployed in both debian and rpm based cluster system and run validation tests for installation. This talk will cover a demo on how to install apex-bigtop and use it. It also covers a test sandbox docker environment, having pre-installed bigtop-hadoop and bigtop-apex, for quickly getting started with apex.

Apache Apex & Bigtop

Apache Apex

Building Your First Apache Apex Application

Apache Apex

More from Apache Apex (12)

Low Latency Polyglot Model Scoring using Apache Apex

Hadoop Interacting with HDFS

Introduction to Real-Time Data Processing

Introduction to Yarn

Introduction to Map Reduce

HDFS Internals

Intro to Big Data Hadoop

Big Data Berlin v8.0 Stream Processing with Apache Apex

Apache Beam (incubating)

Making sense of Apache Bigtop's role in ODPi and how it matters to Apache Apex

Apache Apex & Bigtop

Building Your First Apache Apex Application

Recently uploaded

When you think of a highly secure meeting environment, do you instantly think 'Microsoft Teams'!? Or do you think about some unknown application, troublesome UI and daunting login process...? If you think the latter - let's change that! In this session Femke will show you how using Teams Premium features can create secure, but also good looking meetings! PRETTY. Make sure your company's brand is represented before, during and after the meeting with Customization policies in place. SECURE. Lets utilize Meeting templates and Sensitivity Labels to protect your meeting and data to prevent sensitive information from being leaked. After this session, you will have a clear understanding of the capabilities of Teams Premium features and how to set up the perfect meeting that suits your organizational requirements!

ECS 2024 Teams Premium - Pretty Secure

Femke de Vroome

Intrigued by why some of the world's largest companies (Netflix, Google, Cisco, Twitter, Uber etc) are using gRPC? In this demo based talk we delve into the world of gRPC in .Net, what it does and why we should use it. We compare the interface with both Rest and graphQL. We will show you how to implement grpc server-side in .net and in the web. Finally, I will show you how the tooling helps you deliver powerful interfaces and interact with them quickly and simply.

Demystifying gRPC in .Net by John Staveley

John Staveley

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf

FIDO Alliance

The Metaverse: Are We There Yet?

Mark Billinghurst

As an SEO expert specializing in the IPTV and VPN niches with over five years of experience, I navigate the unique challenges of these industries adeptly. My strategic approach encompasses competitive analysis, targeted keyword research, content optimization, and high-quality backlink creation. My goal is to optimize my clients' online visibility, generating targeted organic traffic and maximizing their return on investment. With a results-driven approach and a passion for innovation, I'm poised to assist my clients in thriving in an ever-evolving digital landscape. <a href="https://iptvreel.com">

THE BEST IPTV in GERMANY for 2024: IPTVreel

reely ones

Ever caught yourself nodding along when someone mentions "delivering value" in Agile, but secretly wondering what the heck they actually mean? You're not alone! Join us for an eye-opening session where we'll strip away the buzzwords and dive into the heart of Agile—value delivery. But what is "value"? Is it a mythical unicorn in the world of software development, or is there more to this overused term? This isn't going to be a sit-and-get lecture. We're talking about a face-to-face, interactive meetup where YOU play a crucial role. Come along to: Define It: What does "value" really mean? We’ll build a definition that’s not just words, but a compass for your Agile journey. Contextualise It: Discover what value means specifically to you, your team, your company, and your industry. Because one size does not fit all. Deliver It: Share strategies and gather new ones for uncovering and delivering true value—no more shooting in the dark!

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

David Michel

ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...

FIDO Alliance

Screen flow is a powerful automation tool that is commonly designed for internal and external users. However, what about the guest users? We will dive into various methods of launching screen flows and understand how to make them publicly accessible, extending their usability to a broader audience. The presentation will also cover the implementation of security layers and highlight best practices for a smooth and protected user experience. Discover the potential of screen flows beyond conventional use and learn how to leverage them effectively.

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade

CzechDreamin

What's New in Teams Calling, Meetings and Devices April 2024

Stephanie Beckett

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...

FIDO Alliance

Intro in Product Management - Коротко про професію продакт менеджера

Mark Opanasiuk

This presentation focuses on the challenges and strategies of connecting problem definitions within product development. Key Points Covered: - Kayak's mission since its inception in 2004 to simplify travel by enabling easy comparisons of flights through technological solutions. - Discussion of the complexities within the travel industry, including the high expectations for personalized user experiences and the various stakeholder influences. - Emphasis on the necessity of maintaining agility and innovation within a mature company through continuous reassessment of processes. - An explanation of the importance of disciplined problem definition to prevent project failures and team inefficiencies. - Introduction of strategies for effective communication across teams to ensure alignment and comprehension at all levels of project development. - Exploration of various problem-solving methodologies, including how to handle conflicts within team settings regarding problem definitions and project directions.

Connecting the Dots in Product Design at KAYAK

UXDXConf

How to differentiate Sales Cloud and CPQ on first glance might be tricky if you do not know where to look and what to look at. You will know :-) Managing the sales process within Salesforce is a common use case that can be managed with standart Sales Cloud. If you want to do entire quoting process you will find out Salesforce CPQ solution exists. What is then the difference if both can handle selling products? You will see comparison of 10 different features, which Sales Cloud and Salesforce CPQ handle differently. Simple question you will always remember if you should consider using Salesforce CPQ will be a cherry on top.

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

CzechDreamin

Extensible Python: Robustness through Addition - PyCon 2024

Patrick Viafore

Already know how to write a basic SOQL query? Great! But what about an *aggregate* SOQL query? You know, the kind that uses aggregate functions like COUNT & MAX along with GROUP BY and HAVING clauses? No? Well, get ready to learn how to slice & dice your org’s data right inside your own dev console. From finding duplicate records to prototyping summary & matrix reports, learn the ins and outs of aggregate queries during this fast-paced but admin-friendly session on advanced SOQL concepts.

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...

CzechDreamin

Oauth 2.0 Introduction and Flows with MuleSoft

shyamraj55

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

FIDO Alliance

I'm excited to share my latest predictions on how AI, robotics, and other technological advancements will reshape industries in the coming years. The slides explore the exponential growth of computational power, the future of AI and robotics, and their profound impact on various sectors. Why this matters: The success of new products and investments hinges on precise timing and foresight into emerging categories. This deck equips founders, VCs, and industry leaders with insights to align future products with upcoming tech developments. These insights enhance the ability to forecast industry trends, improve market timing, and predict competitor actions. Highlights: ▪ Exponential Growth in Compute: How $1000 will soon buy the computational power of a human brain ▪ Scaling of AI Models: The journey towards beyond human-scale models and intelligent edge computing ▪ Transformative Technologies: From advanced robotics and brain interfaces to automated healthcare and beyond ▪ Future of Work: How automation will redefine jobs and economic structures by 2040 With so many predictions presented here, some will inevitably be wrong or mistimed, especially with potential external disruptions. For instance, a conflict in Taiwan could severely impact global semiconductor production, affecting compute costs and related advancements. Nonetheless, these slides are intended to guide intuition on future technological trends.

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

Peter Udo Diehl

We're living the AI revolution and Salesforce is adapting and bring new value to their customers. Einstein products are evolving rapidly and navigating their limitations, language support, and use cases can be challenging. Let's make review of what Einstein product are available currently, what are the capabilities and what can be used for in CEE region and how Rossie.ai can help to learn Salesforce speak Czech. We will explore the Einstein roadmap and I will make a short live demo (based on your vote) of some Einstein feature.

AI revolution and Salesforce, Jiří Karpíšek

CzechDreamin

Speed Wins: From Kafka to APIs in Minutes

confluent

Recently uploaded (20)

ECS 2024 Teams Premium - Pretty Secure

Demystifying gRPC in .Net by John Staveley

Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf

The Metaverse: Are We There Yet?

THE BEST IPTV in GERMANY for 2024: IPTVreel

Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx

ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...

Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade

What's New in Teams Calling, Meetings and Devices April 2024

Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...

Intro in Product Management - Коротко про професію продакт менеджера

Connecting the Dots in Product Design at KAYAK

10 Differences between Sales Cloud and CPQ, Blanka Doktorová

Extensible Python: Robustness through Addition - PyCon 2024

SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...

Oauth 2.0 Introduction and Flows with MuleSoft

Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...

Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl

AI revolution and Salesforce, Jiří Karpíšek

Speed Wins: From Kafka to APIs in Minutes

Extending The Yahoo Streaming Benchmark to Apache Apex

1. Extending the Yahoo Streaming Benchmark for Apache Apex San Jose Apache Apex Meetup May 4th 2016 Sandesh Hegde sandesh@apache.org

2. Background • Yahoo created a benchmark to compare Stream processing systems and compared Storm, Flink and Spark Streaming [1] • dataArtisans extended the benchmark by comparing Flink and Storm with different scenarios [2] • No benchmark comparison about Stream processing is complete without including Apache Apex. 2

3. Yahoo Streaming Benchmark Simple Advertisement Application : To see how many times an ad campaign has been seen in an window. • Read ads from Kafka • Deserialize JSON string • Filter unnecessary ads • Projection of Fields ( remove non-essential fields ) • Join ad id with campaign id from Redis • Windowed count per campaign and output to Redis 3

4. Application - with Kafka 4 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields

5. Setup • Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz • 10GigE Between compute nodes • 4 Kafka Brokers ( 2 Partitions each & 1 Replica ) • Kafka Version : 0.8.2 • Apex ( 3.4-SNAPSHOT & 3.3 ) & Flink ( 1.0.2 ) • Yarn-Containers size: 16GB • 1 ZooKeeper • Message Size: 218 Bytes • Sample Message: {"user_id":"e5e0db4b-05ea-4ac5-af7a-4bba5ed27c4c"," page_id":"80f60d0a-b02b-40e2-a667-5548a1120dda","ad_id":" 600589859","ad_type":"banner78","event_type":"purchase","event_time":" 1462374087774","ip_address":"1.2.3.4"} 5

6. Apex Application 6

7. Physical Plan 7

8. Quick Primer on Locality 8 • CONTAINER_LOCAL ■ Deployed in the same process, different threads ■ No serialization ■ Queue between the operators • THREAD_LOCAL ■ Same thread ■ No serialization ■ Use it only when operators do light work Note: [New feature] Anti Affinity is not covered here.

9. Benchmarking Against Previous Releases 9 https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ Part of Release Certification

10. Application : with Kafka 10 https://github.com/sandeshh/streaming-benchmarks

11. Application - With Generator 11 Kafka Input Deserialize FilterKafka Redis OutputRedis JoinFilter Fields Generator

12. Application - With Generator 12 https://github.com/sandeshh/streaming-benchmarks Setup: Single Partition

13. State of the Art & Streaming 13 Generator Filter Redis OutputRedis JoinFilter Fields What’s our recommendation to query the State? In memory Key-Value store in the operators?

14. Application - State Store & Query 14 Generator Filter Dimensional Computation Redis JoinFilter Fields Store (HDHT) QueryResult 1. Durable state ( HDHT is a key value store native to Hadoop ) [4] 2. Single System, scales with your application 3. Easy integration with external Consoles [7] 4. Low operability cost 5. Complex Dimensional Computation [5][6]

15. Demo 15

16. Q&A 16

17. References 17 1. https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at 2. http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ 3. https://www.datatorrent.com/blog/blog-apex-performance-benchmark/ 4. https://www.datatorrent.com/blog/data-store-for-scalable-stream-processing/ 5. https://www.datatorrent.com/blog/blog-dimensions-computation-aggregate-navigator-part-1-intro/ 6. https://www.datatorrent.com/blog/dimensions-computation-aggregate-navigator-part-2- implementation/ 7. http://docs.datatorrent.com/app_data_framework/

18. © 2016 DataTorrent Resources 18 • Apache Apex website - http://apex.apache.org/ • Subscribe - http://apex.apache.org/community.html • Download - http://apex.apache.org/downloads.html • Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex • Facebook - https://www.facebook.com/ApacheApex/ • Meetup - http://www.meetup.com/topics/apache-apex • Free Enterprise License for Startups - https://www.datatorrent.com/product/startup- accelerator/

Extending The Yahoo Streaming Benchmark to Apache Apex

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex

Similar to Extending The Yahoo Streaming Benchmark to Apache Apex (20)

More from Apache Apex

More from Apache Apex (12)

Recently uploaded

Recently uploaded (20)

Extending The Yahoo Streaming Benchmark to Apache Apex